Commit Graph

795 Commits

Author SHA1 Message Date
Santhosh Kumar T
7dd9d1f3f2
Flashrom refactoring (#6922)
#### Why I did it
To build flashrom properly with dependency tracking.

#### How I did it
Moved flashrom code from platform/broadcom/sonic-platform-modules-dell/tools directory to src/flashrom directory.
At the end, flashrom_0.9.7_amd64.deb package is build which will be installed in the devices.
2021-04-20 15:24:44 -07:00
Samuel Angebault
96690faa5b
[Arista] Fix dockerd issue on Arista platforms (#7376)
Why I did it
Recent systemd upgrade from #7228 requires an extra cmdline parameter for dockerd to start properly.
Updating boot0 was missed as part of the systemd upgrade change.

How I did it
Just added the missing cmdline parameter in files/Aboot/boot0.j2
This change fixes #7372

How to verify it
Boot the image and dockerd should start normally.
2021-04-20 14:55:14 -07:00
Kuanyu Chen
01f2b5f250
[config-setup]: Fix a bug in checking if updategraph is enabled (#7093)
Encounter error during "config-setup boot" if the updategraph is enabled.

How I did it
Correct the code inside the config-setup script.
Remove the space between the assignment operator.

How to verify it
Remove the /etc/sonic/config_db.json and reboot the device.
Originally, it will return following error after boot up.
rv: command not found
After modification, it can correctly parse the status of updategraph without error.
2021-04-19 11:40:52 -07:00
guxianghong
6fe6d7394d
[arm] support compile sonic arm image on arm server (#7285)
- Support compile sonic arm image on arm server. If arm image compiling is executed on arm server instead of using qemu mode on x86 server, compile time can be saved significantly.
- Add kernel argument systemd.unified_cgroup_hierarchy=0 for upgrade systemd to version 247, according to #7228
- rename multiarch docker to sonic-slave-${distro}-march-${arch}

Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
Co-authored-by: Shi Lei <shil@centecnetworks.com>
2021-04-18 08:17:57 -07:00
Stepan Blyshchak
4369361894
[sonic_debian_extension.j2] fix systemd version not from buster-backports (#7322)
Install systemd explicitelly from backports and install libsystemd* packages from backports.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-04-18 08:07:02 -07:00
jmmikkel
43342b33b8
[chassis] Add templates and code to support VoQ chassis iBGP peers (#5622)
This commit has following changes:

* Add templates and code to support VoQ chassis iBGP peers

* Add support to convert a new VoQChassisInternal element in the
   BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR 
   table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
  BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
  element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
  value is "false".
* Add a set of test cases for the new voq_chassis templates in
  sonic-bgpcfgd tests.

Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.

Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
2021-04-16 11:11:32 -07:00
yozhao101
2737c9681f
[container_checker] Exclude the 'always_disabled' container from expected running container list (#7217)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Since we introduced a new value always_disabled for the state field in FEATURE table, the expected running container list
should exclude the always_diabled containers. This bug was found by nightly test and posted at here: issue. This PR fixes #7210.

How I did it
I added a logic condition to decide whether the value of state field of a container was always_disabled or not.

How to verify it
I verified this on the device str-dx010-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
 201911
 202006
[ x] 202012
2021-04-02 08:05:46 -07:00
vganesan-nokia
b313d4d092
[systemlag] Lag id boundary set for system lag (#6488)
Signed-off-by: vedganes <vedavinayagam.ganesan@nokia.com>

Changes for setting platfrom specific lag id boundary id in the chassis
app db. The platfrom specific lag id boundaries are supplied via
chassisdb.conf. The lag_id_start and lag_id_end boundary values sourced
from this file are set in chassis app db which will be used by lag id
allocator to allocate unique lag id in atomic fashion
2021-03-30 23:21:53 -07:00
Stepan Blyshchak
1f7d9e2698
[docker_img_ctl.j2] make tmpfs mounts optional and add ability to run container by image id (#6439)
- Why I did it
I made the docker_img_ctl.j2 applicable for more dockers (including application extensions dockers) by adding an option not to mount tmpfs on /tmp/ and /var/tmp/. In some applications /tmp/ is a different docker volume which can't be tmpfs.
Also, I added and ability to pass REPO[:TAG]|[@digest]/IMAGE_ID instead of just REPO name.

- How I did it
Modified docker_img_ctl.j2 and docker makefiles.

- How to verify it
Run it on the switch.
2021-03-16 17:03:12 +02:00
Stepan Blyshchak
2b8941e716
[sonic_debian_extension] add docker script to SONiC filesystem (#5935)
- Why I did it
To allow SONiC Package Migration during SONiC-2-SONiC upgrade we need to start docker daemon in chroot-ed environment in new SONiC filesystem.
Later this script will be used to start dockerd in chroot environment on SONiC

- How I did it
Install a docker service script into /usr/lib/docker/ in SONiC filesystem.

- How to verify it
Install SONiC image on the switch, mount squashfs to some directory, mount overlay rw layer over squashfs, mount procfs and sysfs, mount docker library. Start the docker using:
root@sonic:~$ /usr/lib/docker/docker.sh start

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-03-14 14:15:42 +02:00
Tamer Ahmed
51ab39fcb2
[hostcfgd]: Add Ability To Configure Feature During Run-time (#6700)
Features may be enabled/disabled for the same topology based on run-time
configuration. This PR adds the ability to enable/disable feature based
on config db data.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-03-13 05:56:27 -08:00
Renuka Manavalan
6f7cd8d772
Copy dummy flannel.conf to get around absence of CNI Network (#6985)
Why I did it
We skip install of CNI plugin, as we don't need. But this leaves node in "not ready" state, upon joining master.
To fix, we copy this dummy .conf file in /etc/cni/net.d

How I did it
Keep this file in /usr/share/sonic/templates and copy to /etc/cni/net.d upon joining k8s master.

How to verify it
Upon configuring master-IP and enable join, watch node join and move to ready state.
You may verify using kubectl get nodes command
2021-03-09 19:49:54 -08:00
yozhao101
21f5e1280d
[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord. (#6849)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
2021-02-25 14:35:29 -08:00
Stepan Blyshchak
12c03c4f25
[sonic_debian_exntesion] install docker_image_ctl.j2 template in the image templates (#5937)
SONiC Package Manager will require to auto-generate the start script using that template. For that, we need this template to be recorded in SONiC filesystem.

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-02-25 09:11:12 -08:00
Stepan Blyshchak
e179ec2fae
[services] introduce sonic.target (#5705)
- Why I did it
Group all SONiC services together and able to manage them together. Will be used in config reload command as much simpler and generic way to restart services.

- How I did it
Add services to sonic.target

- How to verify it
Together with Azure/sonic-utilities#1199
config reload -y

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-02-25 14:26:24 +02:00
Ze Gan
4068944202
[MACsec]: Set MACsec feature to be auto-start (#6678)
1. Add supervisord as the entrypoint of docker-macsec
2. Add wpa_supplicant conf into docker-macsec
3. Set the macsecmgrd as the critical_process
4. Configure supervisor to monitor macsecmgrd
5. Set macsec in the features list
6. Add config variable `INCLUDE_MACSEC`
7. Add macsec.service

**- How to verify it**

Change the `/etc/sonic/config_db.json` as follow
```
{
    "PORT": {
        "Ethernet0": {
            ...
            "macsec": "test"
         }
    }
    ...
    "MACSEC_PROFILE": {
        "test": {
            "priority": 64,
            "cipher_suite": "GCM-AES-128",
            "primary_cak": "0123456789ABCDEF0123456789ABCDEF",
            "primary_ckn": "6162636465666768696A6B6C6D6E6F707172737475767778797A303132333435",
            "policy": "security"
        }
    }
}
```
To execute `sudo config reload -y`, We should find the following new items were inserted in app_db of redis
```
127.0.0.1:6379> keys *MAC*
1) "MACSEC_EGRESS_SC_TABLE:Ethernet0:72152375678227538"
2) "MACSEC_PORT_TABLE:Ethernet0"
127.0.0.1:6379> hgetall "MACSEC_EGRESS_SC_TABLE:Ethernet0:72152375678227538"
1) "ssci"
2) ""
3) "encoding_an"
4) "0"
127.0.0.1:6379> hgetall "MACSEC_PORT_TABLE:Ethernet0"
 1) "enable"
 2) "false"
 3) "cipher_suite"
 4) "GCM-AES-128"
 5) "enable_protect"
 6) "true"
 7) "enable_encrypt"
 8) "true"
 9) "enable_replay_protect"
10) "false"
11) "replay_window"
12) "0"
```

Signed-off-by: Ze Gan <ganze718@gmail.com>
2021-02-23 13:22:45 -08:00
arlakshm
f77157f09d
[baseimage] add ipintutil in sudoer file (#6845)
show ip interfaces is enhanced recently to support multi ASIC platforms in this PR- https://github.com/Azure/sonic-utilities/pull/1396 .
The ipintutil script as to run as sudo user, to get the ip interface from each namespace.
Add this script to the sudoer file so that show ip interface command is available for user with read-only permissions

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-02-22 23:34:28 -08:00
Sujin Kang
d5238ae8dd
[pcie.yaml] Move pcie configuration file path to platform directory (#6475)
- Why I did it
The pcie configuration file location is under plugin directory not under platform directory.
#6437

- How I did it

Move all pcie.yaml configuration file from plugin to platform directory.
Remove unnecessary timer to start pcie-check.service
Move pcie-check.service to sonic-host-services
- How to verify it
Verify on the device
2021-02-21 08:27:37 -08:00
Samuel Angebault
5fb374b03d
[Arista] Driver and platform update (#6468)
- Add support for `DCS-7050SX3-48YC8` and `DCS-7050SX3-48C8` platform
 - Add support for more variants of `DCS-7280CR3-32[PD]4`
 - Add Supervisor to Linecard consutil support
 - Complete Watchdog platform API support
 - Fix some PSU behavior on `DCS-7050QX-32` and `DCS-7060CX-32S`
 - Fix SEU management on `DCS-7060CX-32S`
 - Allow kernel modules to build up to linux 5.10
 - Rename led color `orange` to `amber`
 - Miscellaneous fixes
2021-02-19 10:48:52 -08:00
SuvarnaMeenakshi
5a49a0f499
[multi-asic][vs]: Update topology script to retrieve hwsku from minigraph (#6219)
Update topology script to retrieve hwsku from minigraph
if hwsku information is not available in config_db.
Fix clean up of interfaces in msft_multi_asic_vs hwsku
topology script.
- Why I did it
When bringing up multi-asic VS switch, topology service is started during boot up.
Topology service starts a shell script which runs the topology script present in /usr/share/sonic/device// directory. To invoke hwsku specific script, the topology script tries to retrieve hwsku information from config_db.
During initial boot up config_db might not be populated. In order to start topology service before config_db is updated,
update topology script to get hwsku information from minigraph.xml if it is available.
This will be helpful to bring up multi-asic VS testbed by loading minigraph and starting topology service.
- How I did it
Update topology.sh script to retrieve hwsku information from minigraph.xml.
Fix clean up function on msft_multi_asic_vs toplogy script.
- How to verify it
single-asic VS - no change; topology service is only enabled for multi-asic VS.
multi-asic VS - Bring up multi-asic VS image, copy minigraph to vs image, start topology service. Topology service should be successful.
to test clean up function fix, start topology service - make sure interfaces are created and moved to the right namespaces.
stop topology service - make sure namespace do not have any interface and all front end interfaces are present in default namespace.
2021-02-18 22:02:29 -08:00
xumia
2ef5bd2e90
Add mirrors for reproducible build (#6813) 2021-02-18 14:59:52 +08:00
shlomibitton
f6bee7306e
Stop teamd service before syncd (#6755)
- What I did
All SWSS dependent services should stop before SWSS service to avoid future possible issues.
For example 'teamd' service will stop before to allow the driver unload netdev gracefully.
This is to stop all LAG's before restarting syncd service when running 'config reload' command.

- How I did it
Change the order of dependent services of SWSS.

- How to verify it
Run 'config reload' command.
Previously the operation failed when a large number of PortChannel configured on the system.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-02-15 16:05:34 +02:00
Lawrence Lee
97c605f1f7
[swss]: Clear MUX-related state DB tables on start (#6759)
* Add *MUX_CABLE_TABLE* to set of tables to clear on SWSS start, which
will clear HW_MUX_CABLE_TABLE and MUX_CABLE_TABLE
* Order swss to start before pmon to ensure that DBs are cleared before
xcvrd (running inside pmon) starts and re-populates the tables

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-02-14 12:43:49 -08:00
dflynn-Nokia
88961f1339
[armhf build] Fix azure-storage dependency on cryptography package (#6780)
Fix marvell-armhf build break

The azure-storage package depends on the cryptography package. Newer
versions of cryptography require the rust compiler, the correct version
for which is not readily available in buster. Hence we pre-install an
older version here to satisfy the azure-storage dependency.
Note: This is not a problem for other architectures as pre-built versions
of cryptography are available for those. This sequence can be removed
after upgrading to debian bullseye.
2021-02-14 10:36:04 -08:00
Lior Avramov
6f8c31554f
[systemd] Increase syncd startup script timeout to support FW upgrade on init. (#6709)
**- Why I did it**
To support FW upgrade on init.

**- How I did it**
Change timeout value

**- How to verify it**
I manually changed ASIC and Gearbox FW followed by hard reset in order for FW upgrade to take place on init.

Signed-off-by: liora <liora@nvidia.com>
2021-02-11 12:53:36 +02:00
Arun Saravanan Balachandran
3015de1dd0
[sonic-host-service] Move to sonic-host-services package (#6273)
- Why I did it

To move ‘sonic-host-service’ which is currently built as a separate package to ‘sonic-host-services' package. 

- How I did it

- Moved 'sonic-host-server' to 'src/sonic-host-services' and included it as part of the python3 wheel.
- Other files were moved to 'src/sonic-host-services-data' and included as part of the deb package.
- Changed build option ‘INCLUDE_HOST_SERVICE’ to ‘ENABLE_HOST_SERVICE_ON_START’ for enabling sonic-hostservice at boot-up by default.
2021-02-08 19:35:08 -08:00
SuvarnaMeenakshi
62a599a5b3
[multi_asic][vs]: Add dependency in teamd service to start after topology service(#6594)
[multi_asic][vs]: Add dependency in teamd service to start after topology service.
- Why I did it
In multi-asic VS, topology service is run after database service to set up the internal asic topology.
swss and syncd have a dependency to start after topology service is run so that the interfaces are moved to right namespace and created in the right namespace. In case of multi-asic vs, during the initial boot up, when there is no configuration added, teamd service starts and swss/syncd do not start as topology service does not start. Upon loading configuration using config_db or minigraph, swss and sycnd start up , but teamd is not restarted as swss is not stopped and started. This causes teamd to be in a bad state and requires a reload of config.

- How I did it
Add dependency in teamd service to start after topology service is completed.

- How to verify it
No change in single asic vs or platform.
No change in multi-asic regular image.
Change only in multi-asic VS. Bring up a multi-asic VS image without any configration, teamd service will fail to start due to dependency failure. Load minigraph, start topology service, load configuration, ensure all services come up.
Signed-off-by: SuvarnaMeenakshi <sumeenak@microsoft.com>
2021-02-04 14:10:56 -08:00
Joe LeVeque
820d350301
[pcie-check] Update underlying pcieutil command and add to sudoers file (#6682)
- Why I did it

As of Azure/sonic-utilities#1297, subcommands of pcieutil have changed to remove the redundant pcie- prefix. This PR adapts calling applications (pcie-check) to the new syntax.

Resolves #6676

- How I did it

Remove pcie- prefix from pcieutil subcommands in calling applications
Also add pcieutil * to sudoers file, as pcieutil requires elevated permissions
2021-02-04 12:14:08 -08:00
Guohan Lu
3f2a39d583 [proc-exit-listener]: fix syntax error
the bug is introduced in commit 34cca20c

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-02-02 03:58:20 -08:00
Samuel Angebault
0c4d4ace76
[kdump] Fix OOM events in crashkernel (#6447)
A few issues where discovered with crashkernel on Arista platforms.

1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.

2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.

3) The secureboot path allowlist would remove kernel crash reports.

4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`

**- How I did it**

1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.

2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.

3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`

4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.

**- How to verify it**

Regular boot:
 - enable kdump
 - enable docker_inram=on via kernel-params
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness OOM events on the console
 - after: crash kernel works and crash available under /var/crash

Secure boot:
 - enable kdump
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness no kdump
 - after: crash kernel works and crash available under /var/crash


Co-authored-by: Boyang Yu <byu@arista.com>
2021-02-02 01:55:09 -08:00
arlakshm
b5225407ef
[baseimage]: add docker ps to the sudoer file (#6604)
fixes Azure/sonic-utilities#1389

With the recent changes in sudoer files. The  show commands fails for the read-only users. 
The problem here is the 'docker ps' is failing in the function [get_routing_stack()](8a1109ed30/show/main.py (L54)) therefore all the CLI commands are failing.

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-01-29 08:16:32 -08:00
arlakshm
ff8cc49b18
[multi asic] add ip netns identify command to sudoer (#6591)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>

- Why I did it
The command sudo ip netns identify <pid> is used in function get_current_namespace
to check in the cli command is running in host context or within a namespace.

This function is used for every CLI command and command sudo ip netns identify <pid> needs to be added in sudoer files to allow users with RO access to run show cli commands

This problem is not there on single asic platforms.

- How I did it
Add ip netns identify [0-9]* to sudoers file.
2021-01-28 23:12:01 -08:00
Guohan Lu
34cca20cb6 [proc-exit-listener]: ignore blank lines
make proc-exit-listener more rebust

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 19:41:59 -08:00
abdosi
cfa8fbbf1a
[baseimage]: Updates for Ebtables and support for multi-asic (#6542)
Following changes were done for ebtables:

- Support for Multi-asic platforms. Ebtable filters are installed in namespace for multi-asic and not host. On Single asic installed on  host.

- For Multi-asic platforms we don't want to install on host otherwise Namespace-to-Namespace communication does not happens since ARP Request are not forwarded.

- Updated to use text file to restore ebtables rules then the binary format. Rules are restore as part of Database docker init instead of rc.local

- Removed the ebtable service files for buster as not needed as filters are restored/installed as part of database docker init.
   All the binaries are pre-installed with ebtables* binary are same as ebatbles-legacy-* 

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2021-01-27 08:36:10 -08:00
judyjoseph
46b3bd5503
[teamd]: Increase wait timeout for teamd docker stop to clean Port channels. (#6537)
The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.
Fixes #6199
To check if it works out for this issue in 201911 ? #6503

This issue is significantly seen in master branch compared to 201911 because the Portchannel cleanup takes more time in master. Test on a DUT with 8 Port Channels.

master

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m15.599s
    user    0m0.061s
    sys     0m0.038s
Sonic 201911.v58

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m5.541s
    user    0m0.020s
    sys     0m0.028s
2021-01-23 20:57:52 -08:00
arlakshm
0e12ca81c7
[Multi Asic] support of swss.rec and sairedis.rec for multi asic (#6310)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com

- Why I did it
This PR has the changes to support having different swss.rec and sairedis.rec for each asic.
The logrotate script is updated as well

- How I did it

Update the orchagent.sh script to use the logfile name options in these PRs(Azure/sonic-swss#1546 and Azure/sonic-sairedis#747)
In multi asic platforms the record files will be different for each asic, with the format swss.asic{x}.rec and sairedis.asic{x}.rec

Update the logrotate script for multiasic platform .
2021-01-22 09:42:19 -08:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00
Qi Luo
25e4d773b9
[baseimage]: Cleanup sudoers file (#6518) 2021-01-21 08:28:32 -08:00
Ying Xie
054f5b7a53
[warm boot finalizer] only wait for enabled components to reconcile (#6454)
* [warm boot finalizer] only wait for enabled components to reconcile

Define the component with its associated service. Only wait for components that have associated service enabled to reconcile during warm reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2021-01-15 07:48:11 -08:00
yozhao101
04cd1d61e8
[Monit] Monitoring the running status of containers. (#6251)
**- Why I did it**
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually. 

**- How I did it**
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

**- How to verify it**
I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2021-01-07 19:52:22 -08:00
Renuka Manavalan
dbc6718408
Take a copy of existing TACACS credentials and restore it during upgrade (#6285)
In scenario where upgrade gets config from minigraph, it could miss tacacs credentials as they are not in minigraph. Hence restore explicitly upon load-minigraph, if present.

- Why I did it
Upon boot, when config migration is required, the switch could load config from minigraph. The config-load from minigraph would wipe off TACACS key and disable login via TACACS, which would disable all remote user access. This change, would re-configure the TACACS if there is a saved copy available.

- How I did it
When config is loaded from minigraph, look for a TACACS credentials back up (tacacs.json) under /etc/sonic/old_config. If present, load the credentials into running config, before config-save is called.

- How to verify it
Remove /etc/sonic/config_db.json and do an image update. Upon reboot, w/o this change, you would not be able ssh in as remote user. You may login as admin and check out, "show tacacs" & "show aaa" to verify that tacacs-key is missing and login is not enabled for tacacs.
With this change applied, remove /etc/sonic/config_db.json, but save tacacs & aaa credentials as tacacs.json in /etc/sonic/. Upon reboot, you should see remote user access possible.
2021-01-07 16:45:38 -08:00
Joe LeVeque
e52581e919
[PDDF] Build and install Python 3 package (#6286)
- Make PDDF code compliant with both Python 2 and Python 3
- Align code with PEP8 standards using autopep8
- Build and install both Python 2 and Python 3 PDDF packages
2021-01-07 10:03:29 -08:00
Akhilesh Samineni
62e7c452d0
After first bootup, the FEATURE table is not present in CONFIG_DB (#5911)
Fix the After first bootup(onie-install), the FEATURE table is not present in CONFIG_DB. 
Fix is done by calling config reload.
2021-01-05 09:22:16 -08:00
Joe LeVeque
566ea4f601
[system-health] Convert to Python 3 (#5886)
- Convert system-health scripts to Python 3
- Build and install system-health as a Python 3 wheel
- Also convert newlines from DOS to UNIX
2020-12-29 14:04:09 -08:00
Joe LeVeque
62662acbd5
No longer install some unnecessary Python 2 packages in host (#6301)
- No longer install Python 2 packages in host:
    - libpython2.7-dev
    - docker
    - ipaddress
    - netifaces
    - azure-storage
    - watchdog
    - futures

- Install Python 3 versions of the following packages in host:
    - docker
    - azure-storage
    - watchdog
    - redis
    - swsssdk (install unconditionally)
2020-12-29 13:02:11 -08:00
lguohan
162f0fdfe1
[init_cfg]: allow enable/disable swss/teamd/syncd services (#6291)
swss/teamd/syncd services were changed to always enabled
in commit fad481edc1 as a workaround
for not letting hostcfgd start service during the bootup process.

commit 317a4b3410 introduce
wait till full system bootup before updating feature states in hostcfgd.

Thus, workaround introduced in commit fad481ed can be removed

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-12-28 10:33:46 -08:00
Samuel Angebault
163ed6acff
[Arista] Better handle arbitrary tmpfs in boot0 (#6274)
To limit IO and space usage on the flash device the boot0 script makes sure the SWI is in memory.
Because SONiC maps /tmp on the flash, some logic is required to make sure of it.
However it is possible for some provisioning mechanism to already download the swi in a memory file system.
This case was not handled properly by the boot0 script.
It now detect if the image is on a tmpfs or a ramfs and keep it there if that is the case.

The cleanup method has been updated accordingly and will only cleanup
the mount path if it's below /tmp/ as to not affect user mounted paths.

- How I did it

Check the filesystem on which the SWI pointed by swipath lies.
If this filesystem is a ramfs or a tmpfs the move_swi_to_tmpfs becomes a no-op.
Made sure the cleanup logic would not behave unexpectedly.

- How to verify it

In SONiC:

Download the swi under /tmp and makes sure it gets moved to /tmp/tmp-swi which gets mounted for that purpose.
Make sure /tmp/tmp-swi gets unmounted once the install process is done.

Create a new mountpoint under /ram using either ramfs or tmpfs and download the swi there.
Install the swi using sonic-installer and makes sure the image doesn't get moved by looking at the logs.
2020-12-23 22:38:59 -08:00
Prabhu Sreenivasan
df13245b9f
[CRM] Add support for snat, dnat and ipmc crm resources (#6012)
Signed-off-by: Prabhu Sreenivasan prabhu.sreenivasan@broadcom

What I did
Added support for snat, dnat and ipmc resources under CRM module.

How I did it
New feature NAT adds new resources snat_enty and dnat_entry that needs to be monitored. ipmc_entry tracks IP multicast resources used by switch.

How to verify it
sonic-utilities tests and crm spytest
2020-12-23 06:15:53 -08:00
lguohan
aa1cc848e2
[sonic-yang-mgmt-py2]: remove sonic-yang-mgmt py2 (#6262)
No longer needed as sonic-utilties has been moved python3

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-12-22 21:05:33 -08:00
Renuka Manavalan
ba02209141
First cut image update for kubernetes support. (#5421)
* First cut image update for kubernetes support.
With this,
    1)  dockers dhcp_relay, lldp, pmon, radv, snmp, telemetry are enabled
        for kube management
        init_cfg.json configure set_owner as kube for these

    2)  Each docker's start.sh updated to call container_startup.py to register going up
          As part of this call, it registers the current owner as local/kube and its version
          The images are built with its version ingrained into image during build

    3)  Update all docker's bash script to call 'container start/stop/wait' instead of 'docker start/stop/wait'.
         For all locally managed containers, it calls docker commands, hence no change for locally managed.
        
    4)  Introduced a new ctrmgrd service, that helps with transition between owners as  kube & local and carry over any labels update from STATE-DB to API server

    5)  hostcfgd updated to handle owner change

    6) Reboot scripts are updatd to tag kube running images as local, so upon reboot they run the same image.

   7) Added kube_commands.py to handle all updates with Kubernetes API serrver -- dedicated for k8s interaction only.
2020-12-22 08:01:33 -08:00