Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.
How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.
How to verify it
I verified this implementation on device str-7260cx3-acs-1.
- Why I did it
To give SONiC Application Extension developers an environment to run and develop their apps.
- How I did it
Created sonic-sdk and sonic-sdk-buildenv dockers and their dbg versions.
- How to verify it
Build:
$ make -f slave target/sonic-sdk.gz target/sonic-sdk-buildenv.gz
#### Why I did it
If a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer.
No panic occurs in this case, because other node's memory may be free.
This means system total status may be not fatal yet.
#### How I did it
Remove 'vm.panic_on_oom=1' kernel flag from 'vmcore-sysctl.conf '
Why I did it
Currently, there is a bug in the ntp.conf jinja2 template where it will ignore the src_intf directive in CONFIG_DB if there are multiple IP addresses associated with an interface. This code change fixes that bug and allows the template to select the correct source interface for NTP.
How I did it
I did this by modifying the macro in ntp.conf.j2 which determines if there is an ip address associated with an interface to set a state variable when it detects a valid interface entry in CONFIG_DB instead of outputting "true" directly (which could result in multiple "trues" outputted for interfaces with multiple valid IP addresses).
How to verify it
Add two ipv4 addresses to an interface in SONiC
Add the following configuration to config_db.json
{
"NTP": {
"global": {
"src_intf": "Ethernet1"
}
}
}
Replace Ethernet1 with the interface name of the one you assigned the IP addresses to.
Run sudo config reload -y
Open /etc/ntp.conf and verify that the following line exists
...
interface listen Ethernet1
...
The interface specified should be the one set in the previous steps.
Description for the changelog
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB
Map priority 0 to TC 1 and priority 1 to TC 0
Send traffic on priority 0 and 1 and verified that it gets mapped correctly in hw
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
start pcie-check.service after config-setup.service since pcie_util depends on device_info which is available with config db metadata.
How I did it
Add config-setup.service as a dependency of pcie-check.service
How to verify it
Upon reboot, check if the pcie-check.sh throws the platform api error which is dependent on DEVICE_METADATA
Why I did it
Finding running containers through "docker ps" breaks when kubernetes deploys container, as the names are mangled.
How I did it
The data is is available from FEATURE table, which takes care of kubernetes deployment too.
How to verify it
Deploy a feature via kubernetes and don't expect error from container_check.
Signed-off-by: Stepan Blyschak stepanb@nvidia.com
This PR is part of SONiC Application Extension
Depends on #5938
- Why I did it
To provide an infrastructure change in order to support SONiC Application Extension feature.
- How I did it
Label every installable SONiC Docker with a minimal required manifest and auto-generate packages.json file based on
installed SONiC images.
- How to verify it
Build an image, execute the following command:
admin@sonic:~$ docker inspect docker-snmp:1.0.0 | jq '.[0].Config.Labels["com.azure.sonic.manifest"]' -r | jq
Cat /var/lib/sonic-package-manager/packages.json file to verify all dockers are listed there.
Why I did it
Support readonly version of the command vtysh
How I did it
Check if the command starting with "show", and verify only contains single command in script.
Fix#7364
99-default.link - was always in SONiC, but previous systemd (<247) had an issue and it did not work due to issue systemd/systemd#3374. Now systemd 247 works.
However, such policy overrides teamd provided mac address which causes teamd netdev to use a random mac
address. Therefore, needs to be disabled.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
#### Why I did it
To build flashrom properly with dependency tracking.
#### How I did it
Moved flashrom code from platform/broadcom/sonic-platform-modules-dell/tools directory to src/flashrom directory.
At the end, flashrom_0.9.7_amd64.deb package is build which will be installed in the devices.
Why I did it
Recent systemd upgrade from #7228 requires an extra cmdline parameter for dockerd to start properly.
Updating boot0 was missed as part of the systemd upgrade change.
How I did it
Just added the missing cmdline parameter in files/Aboot/boot0.j2
This change fixes#7372
How to verify it
Boot the image and dockerd should start normally.
Encounter error during "config-setup boot" if the updategraph is enabled.
How I did it
Correct the code inside the config-setup script.
Remove the space between the assignment operator.
How to verify it
Remove the /etc/sonic/config_db.json and reboot the device.
Originally, it will return following error after boot up.
rv: command not found
After modification, it can correctly parse the status of updategraph without error.
- Support compile sonic arm image on arm server. If arm image compiling is executed on arm server instead of using qemu mode on x86 server, compile time can be saved significantly.
- Add kernel argument systemd.unified_cgroup_hierarchy=0 for upgrade systemd to version 247, according to #7228
- rename multiarch docker to sonic-slave-${distro}-march-${arch}
Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
Co-authored-by: Shi Lei <shil@centecnetworks.com>
This commit has following changes:
* Add templates and code to support VoQ chassis iBGP peers
* Add support to convert a new VoQChassisInternal element in the
BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR
table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
value is "false".
* Add a set of test cases for the new voq_chassis templates in
sonic-bgpcfgd tests.
Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.
Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Since we introduced a new value always_disabled for the state field in FEATURE table, the expected running container list
should exclude the always_diabled containers. This bug was found by nightly test and posted at here: issue. This PR fixes#7210.
How I did it
I added a logic condition to decide whether the value of state field of a container was always_disabled or not.
How to verify it
I verified this on the device str-dx010-acs-1.
Which release branch to backport (provide reason below if selected)
201811
201911
202006
[ x] 202012
Signed-off-by: vedganes <vedavinayagam.ganesan@nokia.com>
Changes for setting platfrom specific lag id boundary id in the chassis
app db. The platfrom specific lag id boundaries are supplied via
chassisdb.conf. The lag_id_start and lag_id_end boundary values sourced
from this file are set in chassis app db which will be used by lag id
allocator to allocate unique lag id in atomic fashion
- Why I did it
I made the docker_img_ctl.j2 applicable for more dockers (including application extensions dockers) by adding an option not to mount tmpfs on /tmp/ and /var/tmp/. In some applications /tmp/ is a different docker volume which can't be tmpfs.
Also, I added and ability to pass REPO[:TAG]|[@digest]/IMAGE_ID instead of just REPO name.
- How I did it
Modified docker_img_ctl.j2 and docker makefiles.
- How to verify it
Run it on the switch.
- Why I did it
To allow SONiC Package Migration during SONiC-2-SONiC upgrade we need to start docker daemon in chroot-ed environment in new SONiC filesystem.
Later this script will be used to start dockerd in chroot environment on SONiC
- How I did it
Install a docker service script into /usr/lib/docker/ in SONiC filesystem.
- How to verify it
Install SONiC image on the switch, mount squashfs to some directory, mount overlay rw layer over squashfs, mount procfs and sysfs, mount docker library. Start the docker using:
root@sonic:~$ /usr/lib/docker/docker.sh start
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
Features may be enabled/disabled for the same topology based on run-time
configuration. This PR adds the ability to enable/disable feature based
on config db data.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Why I did it
We skip install of CNI plugin, as we don't need. But this leaves node in "not ready" state, upon joining master.
To fix, we copy this dummy .conf file in /etc/cni/net.d
How I did it
Keep this file in /usr/share/sonic/templates and copy to /etc/cni/net.d upon joining k8s master.
How to verify it
Upon configuring master-IP and enable join, watch node join and move to ready state.
You may verify using kubectl get nodes command
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.
How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.
How to verify it
I verified and test this implementation on a physical DUT.
SONiC Package Manager will require to auto-generate the start script using that template. For that, we need this template to be recorded in SONiC filesystem.
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
- Why I did it
Group all SONiC services together and able to manage them together. Will be used in config reload command as much simpler and generic way to restart services.
- How I did it
Add services to sonic.target
- How to verify it
Together with Azure/sonic-utilities#1199
config reload -y
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
show ip interfaces is enhanced recently to support multi ASIC platforms in this PR- https://github.com/Azure/sonic-utilities/pull/1396 .
The ipintutil script as to run as sudo user, to get the ip interface from each namespace.
Add this script to the sudoer file so that show ip interface command is available for user with read-only permissions
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
The pcie configuration file location is under plugin directory not under platform directory.
#6437
- How I did it
Move all pcie.yaml configuration file from plugin to platform directory.
Remove unnecessary timer to start pcie-check.service
Move pcie-check.service to sonic-host-services
- How to verify it
Verify on the device
- Add support for `DCS-7050SX3-48YC8` and `DCS-7050SX3-48C8` platform
- Add support for more variants of `DCS-7280CR3-32[PD]4`
- Add Supervisor to Linecard consutil support
- Complete Watchdog platform API support
- Fix some PSU behavior on `DCS-7050QX-32` and `DCS-7060CX-32S`
- Fix SEU management on `DCS-7060CX-32S`
- Allow kernel modules to build up to linux 5.10
- Rename led color `orange` to `amber`
- Miscellaneous fixes
Update topology script to retrieve hwsku from minigraph
if hwsku information is not available in config_db.
Fix clean up of interfaces in msft_multi_asic_vs hwsku
topology script.
- Why I did it
When bringing up multi-asic VS switch, topology service is started during boot up.
Topology service starts a shell script which runs the topology script present in /usr/share/sonic/device// directory. To invoke hwsku specific script, the topology script tries to retrieve hwsku information from config_db.
During initial boot up config_db might not be populated. In order to start topology service before config_db is updated,
update topology script to get hwsku information from minigraph.xml if it is available.
This will be helpful to bring up multi-asic VS testbed by loading minigraph and starting topology service.
- How I did it
Update topology.sh script to retrieve hwsku information from minigraph.xml.
Fix clean up function on msft_multi_asic_vs toplogy script.
- How to verify it
single-asic VS - no change; topology service is only enabled for multi-asic VS.
multi-asic VS - Bring up multi-asic VS image, copy minigraph to vs image, start topology service. Topology service should be successful.
to test clean up function fix, start topology service - make sure interfaces are created and moved to the right namespaces.
stop topology service - make sure namespace do not have any interface and all front end interfaces are present in default namespace.
- What I did
All SWSS dependent services should stop before SWSS service to avoid future possible issues.
For example 'teamd' service will stop before to allow the driver unload netdev gracefully.
This is to stop all LAG's before restarting syncd service when running 'config reload' command.
- How I did it
Change the order of dependent services of SWSS.
- How to verify it
Run 'config reload' command.
Previously the operation failed when a large number of PortChannel configured on the system.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
* Add *MUX_CABLE_TABLE* to set of tables to clear on SWSS start, which
will clear HW_MUX_CABLE_TABLE and MUX_CABLE_TABLE
* Order swss to start before pmon to ensure that DBs are cleared before
xcvrd (running inside pmon) starts and re-populates the tables
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Fix marvell-armhf build break
The azure-storage package depends on the cryptography package. Newer
versions of cryptography require the rust compiler, the correct version
for which is not readily available in buster. Hence we pre-install an
older version here to satisfy the azure-storage dependency.
Note: This is not a problem for other architectures as pre-built versions
of cryptography are available for those. This sequence can be removed
after upgrading to debian bullseye.
**- Why I did it**
To support FW upgrade on init.
**- How I did it**
Change timeout value
**- How to verify it**
I manually changed ASIC and Gearbox FW followed by hard reset in order for FW upgrade to take place on init.
Signed-off-by: liora <liora@nvidia.com>
- Why I did it
To move ‘sonic-host-service’ which is currently built as a separate package to ‘sonic-host-services' package.
- How I did it
- Moved 'sonic-host-server' to 'src/sonic-host-services' and included it as part of the python3 wheel.
- Other files were moved to 'src/sonic-host-services-data' and included as part of the deb package.
- Changed build option ‘INCLUDE_HOST_SERVICE’ to ‘ENABLE_HOST_SERVICE_ON_START’ for enabling sonic-hostservice at boot-up by default.
[multi_asic][vs]: Add dependency in teamd service to start after topology service.
- Why I did it
In multi-asic VS, topology service is run after database service to set up the internal asic topology.
swss and syncd have a dependency to start after topology service is run so that the interfaces are moved to right namespace and created in the right namespace. In case of multi-asic vs, during the initial boot up, when there is no configuration added, teamd service starts and swss/syncd do not start as topology service does not start. Upon loading configuration using config_db or minigraph, swss and sycnd start up , but teamd is not restarted as swss is not stopped and started. This causes teamd to be in a bad state and requires a reload of config.
- How I did it
Add dependency in teamd service to start after topology service is completed.
- How to verify it
No change in single asic vs or platform.
No change in multi-asic regular image.
Change only in multi-asic VS. Bring up a multi-asic VS image without any configration, teamd service will fail to start due to dependency failure. Load minigraph, start topology service, load configuration, ensure all services come up.
Signed-off-by: SuvarnaMeenakshi <sumeenak@microsoft.com>
- Why I did it
As of Azure/sonic-utilities#1297, subcommands of pcieutil have changed to remove the redundant pcie- prefix. This PR adapts calling applications (pcie-check) to the new syntax.
Resolves#6676
- How I did it
Remove pcie- prefix from pcieutil subcommands in calling applications
Also add pcieutil * to sudoers file, as pcieutil requires elevated permissions
A few issues where discovered with crashkernel on Arista platforms.
1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.
2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.
3) The secureboot path allowlist would remove kernel crash reports.
4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`
**- How I did it**
1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.
2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.
3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`
4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.
**- How to verify it**
Regular boot:
- enable kdump
- enable docker_inram=on via kernel-params
- reboot
- generate a crash `echo c > /proc/sysrq-trigger`
- before: witness OOM events on the console
- after: crash kernel works and crash available under /var/crash
Secure boot:
- enable kdump
- reboot
- generate a crash `echo c > /proc/sysrq-trigger`
- before: witness no kdump
- after: crash kernel works and crash available under /var/crash
Co-authored-by: Boyang Yu <byu@arista.com>
fixesAzure/sonic-utilities#1389
With the recent changes in sudoer files. The show commands fails for the read-only users.
The problem here is the 'docker ps' is failing in the function [get_routing_stack()](8a1109ed30/show/main.py (L54)) therefore all the CLI commands are failing.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
The command sudo ip netns identify <pid> is used in function get_current_namespace
to check in the cli command is running in host context or within a namespace.
This function is used for every CLI command and command sudo ip netns identify <pid> needs to be added in sudoer files to allow users with RO access to run show cli commands
This problem is not there on single asic platforms.
- How I did it
Add ip netns identify [0-9]* to sudoers file.