Commit Graph

1205 Commits

Author SHA1 Message Date
Stephen Sun
8f4a1b7b85
[Mellanox] Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario (#11261)
- Why I did it
Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario
This is to port #11032 and #11299 from 202012 to master.

Support additional queue and PG in buffer templates, including both traditional and dynamic model
Support mapping DSCP 2/6 to lossless traffic in the QoS template.
Add macros to generate additional lossless PG in the dynamic model
Adjust the order in which the generic/dedicated (with additional lossless queues) macros are checked and called to generate buffer tables in common template buffers_config.j2
Buffer tables are rendered via using macros.
Both generic and dedicated macros are defined on our platform. Currently, the generic one is called as long as it is defined, which causes the generic one always being called on our platform. To avoid it, the dedicated macrio is checked and called first and then the generic ones.
Support MAP_PFC_PRIORITY_TO_PRIORITY_GROUP on ports with additional lossless queues.
On Mellanox-SN4600C-C64, buffer configuration for t1 is calculated as:

40 * 100G downlink ports with 4 lossless PGs/queues, 1 lossy PG, and 3 lossy queues
16 * 100G uplink ports with 2 lossless PGs/queues, 1 lossy PG, and 5 lossy queues

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-20 09:48:15 +03:00
andywongarista
88d0ce5ce8
Add gbsyncd container for broncos (#11154)
* Add docker-gbsyncd-broncos support
* Address review comments
* Add socket to gbsyncd
* Upgrade gbsyncd-broncos to bullseye
2022-07-18 10:57:27 +08:00
bingwang-ms
52c36cd31f
Add flag to control the generation of PORT_QOS_MAP|global entry (#11448)
Why I did it
This PR is to add a flag to control whether to generate PORT_QOS_MAP|global entry or not.
It's because for some HWSKU, such as BackEndToRRouter and BackEndLeafRouter, there is no DSCP_TO_TC_MAP defined.
Hence, if the PORT_QOS_MAP|global entry is generated, OA will report some error because the DSCP_TO_TC_MAP map AZURE can not be found.

Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- saiObjectTypeQuery: invalid object id oid:0x7fddb43605d0
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- meta_generic_validation_objlist: SAI_SWITCH_ATTR_QOS_DSCP_TO_TC_MAP:SAI_ATTR_VALUE_TYPE_OBJECT_ID object on list [0] oid 0x7fddb43605d0 is not valid, returned null object id
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- applyDscpToTcMapToSwitch: Failed to apply DSCP_TO_TC QoS map to switch rv:-5
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- doTask: Failed to process QOS task, drop it
This PR is to address the issue.

How I did it
Add a flag require_global_dscp_to_tc_map to control whether to generate the PORT_QOS_MAP|global entry. The default value for require_global_dscp_to_tc_map is true. If the device type is storage backend, the value is changed to false. Then the PORT_QOS_MAP|global entry is not generated.

How to verify it
Update the current test_qos_dscp_remapping_render_template to cover storage backend.
2022-07-14 10:19:00 -07:00
Alexander Allen
429254cb2d
[arm] Refactor installer and build to allow arm builds targeted at grub platforms (#11341)
Refactors the SONiC Installer to support greater flexibility in building for a given architecture and bootloader. 

#### Why I did it

Currently the SONiC installer assumes that if a platform is ARM based that it uses the `uboot` bootloader and uses the `grub` bootloader otherwise. This is not a correct assumption to make as ARM is not strictly tied to uboot and x86 is not strictly tied to grub. 

#### How I did it

To implement this I introduce the following changes:
* Remove the different arch folders from the `installer/` directory
* Merge the generic components of the ARM and x86 installer into `installer/installer.sh`
* Refactor x86 + grub specific functions into `installer/default_platform.conf` 
* Modify installer to call `default_platform.conf` file and also call `platform/[platform]/patform.conf` file as well to override as needed
* Update references to the installer in the `build_image.sh` script
* Add `TARGET_BOOTLOADER` variable that is by default `uboot` for ARM devices and `grub` for x86 unless overridden in `platform/[platform]/rules.mk`
* Update bootloader logic in `build_debian.sh` to be based on `TARGET_BOOTLOADER` instead of `TARGET_ARCH` and to reference the grub package in a generic manner

#### How to verify it

This has been tested on a ARM test platform as well as on Mellanox amd64 switches as well to ensure there was no impact. 

#### Description for the changelog
[arm] Refactor installer and build to allow arm builds targeted at grub platforms

#### Link to config_db schema for YANG module changes

N/A
2022-07-12 15:00:57 -07:00
geogchen
5171589f4d
Add support to generate /e/n/i when there are multiple MGMT_INTERFACE (#11368)
Why I did it
Currently interfaces.j2 hardcodes to eth0 even when there are multiple interfaces in MGMT_INTERFACE. This change adds support to generate /e/n/i when there are multiple interfaces in MGMT_INTERFACE.

How I did it
By removing hardcoded eth0 when looping through MGMT_INTERFACE.

How to verify it
Verified through unit test.

Which release branch to backport (provide reason below if selected)
 201811
 201911
 202006
 202012
 202106
 202111
 202205
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)
2022-07-12 14:55:15 -07:00
tjchadaga
4f95974669
Add load_minigraph option to include traffic-shift-away during config migration (#11403) 2022-07-12 10:08:58 -07:00
tjchadaga
849eb4bf32
Changes to persist TSA/B state across reloads (#11257) 2022-07-12 00:22:48 -07:00
Neetha John
7cd10019b8
Add backend acl template (#11220)
Why I did it
Storage backend has all vlan members tagged. If untagged packets are received on those links, they are accounted as RX_DROPS which can lead to false alarms in monitoring tools. Using this acl to hide these drops.

How I did it
Created a acl template which will be loaded during minigraph load for backend. This template will allow tagged vlan packets and dropped untagged

How to verify it
Unit tests

Signed-off-by: Neetha John <nejo@microsoft.com>
2022-07-06 10:24:16 -07:00
Stepan Blyshchak
ef8675d7ab
[sonic_debian_extension] install systemd-bootchart (#11047)
- Why I did it
Implemented sonic-net/SONiC#1001

- How I did it
Install systemd-bootchart tool and provide default config for it.

- How to verify it
Run build and verify systemd-bootchart is installed.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-07-06 14:03:31 +03:00
judyjoseph
baaf8b085f
Revert "Update include_macsec flag if type is SpineRouter (#11141)" (#11306)
This reverts commit c9f36957db.
2022-07-01 11:00:30 -07:00
Jing Zhang
5d03b5d0df
Avoid write_standby in warm restart context (#11283)
Avoid write_standby in warm restart context.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
In warm restart context, we should avoid mux state change.

How I did it
Check warm restart flag before applying changes to app db.

How to verify it
Ran write_standby in table missing, key missing, field missing scenarios.
Did a warm restart, app db changes were skipped. Saw this in syslog:
WARNING write_standby: Taking no action due to ongoing warmrestart.
2022-06-29 21:34:02 -07:00
andywongarista
6e0559d5fa
[Arista] Add initial support for 720DT-48S (#10656)
Added initial set of config files to allow for booting and partial traffic testing in SONiC on the 720DT-48S.

How to verify it
- Switch boots
- show interfaces status shows links up on interfaces Ethernet24-51
- Traffic flows with no errors on interfaces Ethernet24-51
2022-06-29 09:56:24 -07:00
davidpil2002
8b7d069880
Add support for Password Hardening (#10323)
- Why I did it
New security feature for enforcing strong passwords when login or changing passwords of existing users into the switch.

- How I did it
By using mainly Linux package named pam-cracklib that support the enforcement of user passwords, the daemon named hostcfgd, will support add/modify password policies that enforce and strengthen the user passwords.

- How to verify it
Manually Verification-
1. Enable the feature, using the new sonic-cli command passw-hardening or manually add the password hardening table like shown in HLD by using redis-cli command

2. Change password policies manually like in step 1.
Notes:
password hardening CLI can be found in sonic-utilities repo-
P.R: Add support for Password Hardening sonic-utilities#2121
code config path: config/plugins/sonic-passwh_yang.py
code show path: show/plugins/sonic-passwh_yang.py

3. Create a new user (using adduser command) or modify an existing password by using passwd command in the terminal. And it will now request a strong password instead of default linux policies.

Automatic Verification - Unitest:
This PR contained unitest that cover:
1. test default init values of the feature in PAM files
2. test all the types of classes policies supported by the feature in PAM files
3. test aging policy configuration in PAM files
2022-06-29 15:34:56 +03:00
bingwang-ms
ac86f71287
Add extra lossy PG profile for ports between T1 and T2 (#11157)
Signed-off-by: bingwang <wang.bing@microsoft.com>

Why I did it
This PR brings two changes

Add lossy PG profile for PG2 and PG6 on T1 for ports between T1 and T2.
After PR Update qos config to clear queues for bounced back traffic #10176 , the DSCP_TO_TC_MAP and TC_TO_PG_MAP is updated when remapping is enable

DSCP_TO_TC_MAP
Before	After	Why do this change
"2" : "1"	"2" : "2"	Only change for leaf router to map DSCP 2 to TC 2 as TC 2 will be used for lossless TC
"6" : "1"	"6" : "6"	Only change for leaf router to map DSCP 6 to TC 6 as TC 6 will be used for lossless TC

TC_TO_PRIORITY_GROUP_MAP
Before	After	Why do this change
"2" : "0"	"2" : "2"	Only change for leaf router to map TC 2 to PG 2 as PG 2 will be used for lossless PG
"6" : "0"	"6" : "6"	Only change for leaf router to map TC 6 to PG 6 as PG 6 will be used for lossless PG

So, we have two new lossy PGs (2 and 6) for the T2 facing ports on T1, and two new lossless PGs (2 and 6) for the T0 facing port on T1.
However, there is no lossy PG profile for the T2 facing ports on T1. The lossless PGs for ports between T1 and T0 have been handled by buffermgrd .Therefore, We need to add lossy PG profiles for T2 facing ports on T1.

We don't have this issue on T0 because PG 2 and PG 6 are lossless PGs, and there is no lossy traffic mapped to PG 2 and PG 6

Map port level TC7 to PG0
Before the PCBB change, DSCP48 -> TC 6 -> PG 0.
After the PCBB change, DSCP48 -> TC 7 -> PG 7
Actually, we can map TC7 to PG0 to save a lossy PG.

How I did it
Update the qos and buffer template.

How to verify it
Verified by UT.
2022-06-28 12:50:33 -07:00
judyjoseph
c9f36957db
Update include_macsec flag if type is SpineRouter (#11141)
Add the support to enable macsec when type is SpineRouter
2022-06-24 10:32:02 -07:00
geogchen
6a9c058a92
Revert "Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204)" (#11241)
This reverts commit 90a849ea85.

#### Why I did it
The interfaces unit test did not cover some of the conditions in interfaces.j2 that was changed in #11204.  Therefore reverting the change and add the tests before making the change to interfaces.j2.

#### How I did it
Git revert.

#### How to verify it

#### Which release branch to backport (provide reason below if selected)


- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog

#### Link to config_db schema for YANG module changes

#### A picture of a cute animal (not mandatory but encouraged)
2022-06-24 06:30:33 -07:00
Sudharsan Dhamal Gopalarathnam
9452095e25
[lldp]Fix lldp spawned after reboot when disabled (#11080)
- Why I did it
When LLDP is disabled through feature command, it gets spawned after reboot.

- How I did it
In syncd.sh check if the service is enabled before spawning automatically during cold reboot.

- How to verify it
Disable lldp feature. Perform cold reboot and verify its not spawned.
2022-06-22 03:11:41 +03:00
geogchen
90a849ea85
Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204)
* [Interfaces] Modify template to support multiple management interfaces

* Modify minigraph to process interfaces in sorted order

Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>

* Add UT minigraph

Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>

* make case insensitve comparison

Signed-off-by: George Chen <gechen@microsoft.com>

* Use natural sort

Signed-off-by: George Chen <gechen@microsoft.com>

Co-authored-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>
2022-06-21 10:16:10 -07:00
xumia
fdef1f0342
[Build]: Support to use symbol links for lazy installation targets to reduce the image size (#10923)
Why I did it
Support to use symbol links in platform folder to reduce the image size.
The current solution is to copy each lazy installation targets (xxx.deb files) to each of the folders in the platform folder. The size will keep growing when more and more packages added in the platform folder. For cisco-8000 as an example, the size will be up to 2G, while most of them are duplicate packages in the platform folder.

How I did it
Create a new folder in platform/common, all the deb packages are copied to the folder, any other folders where use the packages are the symbol links to the common folder.

Why platform.tar?
We have implemented a patch for it, see #10775, but the problem is the the onie use really old unzip version, cannot support the symbol links.
The current solution is similar to the PR 10775, but make the platform folder into a tar package, which can be supported by onie. During the installation, the package.tar will be extracted to the original folder and removed.
2022-06-21 13:03:55 +08:00
jingwenxie
fdc65d7600
Remove minigraph loading in updategraph script (#11146)
Why I did it
Minigraph will be deprecated in the future. So minigraph related reload should be deleted.

How I did it
Remove unused load_minigraph
2022-06-21 08:57:57 +08:00
Stepan Blyshchak
42576d2664
[auto-ts] add memory check (#10433)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-06-20 09:39:05 -07:00
yozhao101
241f4454b4
[memory_checker] Do not check memory usage of containers which are not created (#11129)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
2022-06-17 12:13:18 -07:00
bingwang-ms
83f23e26ff
Generate switch level dscp_to_tc_map entry from qos_config template (#11087)
* Generate switch level dscp_to_tc_map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-17 08:44:30 +08:00
byu343
89020f53e4
[Arista] Add support support for 7060dx5_64s and 7060px5_64s (#10888)
Why I did it
This change adds the support for Arista 7060dx5_64s and 7060px5_64s

How I did it
How to verify it
We verified the platform driver is working and the ports are up on 7060dx5_64s and 7060px5_64s.
2022-06-16 09:51:42 -07:00
Samuel Angebault
30bfed92fd
[Arista] Add configuration files for 7050X4-32S platform (#10799)
Add most configuration files for the DCS-7050PX4-32S and DCS-7050DX4-32S.
This review only contains platform configuration files, dataplane ones will follow in future change.

Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
2022-06-16 09:42:10 -07:00
shlomibitton
1474ad76d8
[Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10901)
- Why I did it
Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform.
Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again.

- How I did it
On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running.
If it is running it means we are at the first boot and continue normally.
If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent.

- How to verify it
Run fast/warm reboot.
service swss restart.
Observe PMON service starting.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2022-06-16 12:15:09 +03:00
jingwenxie
cca3b5be5b
Reduce logic in updategraph (#11010)
Why I did it
The dhcp_graph_url used by internal service is always set as "N/A". So we can make the updategraph logic short.

How I did it
Shorten 'if statement' logic for /tmp/dhcp_graph_url
2022-06-14 22:18:47 +08:00
judyjoseph
0b1ae9c43c
Cleanup macsec stateDB tables on restart (#11066)
Clean macsec tables in STATE_DB on start
2022-06-09 15:32:24 -07:00
bingwang-ms
1cc602c6af
Add two extra lossless queues for bounced back traffic (#10496)
Signed-off-by: bingwang <bingwang@microsoft.com>

Why I did it
This PR is to add two extra lossless queues for bounced back traffic.
HLD sonic-net/SONiC#950

SKUs include
Arista-7050CX3-32S-C32
Arista-7050CX3-32S-D48C8
Arista-7260CX3-D108C8
Arista-7260CX3-C64
Arista-7260CX3-Q64

How I did it
Update the buffers.json.j2 template and buffers_config.j2 template to generate new BUFFER_QUEUE table.

For T1 devices, queue 2 and queue 6 are set as lossless queues on T0 facing ports.
For T0 devices, queue 2 and queue 6 are set as lossless queues on T1 facing ports.
Queue 7 is added as a new lossy queue as DSCP 48 is mapped to TC 7, and then mapped into Queue 7

How to verify it
Verified by UT
Verified by coping the new template and generate buffer config with sonic-cfggen
2022-06-02 13:03:27 -07:00
bingwang-ms
0c9bbee735
Update qos template to support SYSTEM_DEFAULT table (#10936)
* Update qos template to support SYSTEM_DEFAULT table

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-02 21:48:57 +08:00
xumia
0552d6b172
Support symcrypt fips config for aboot/uboot (#10729)
Why I did it
Support symcrypt fips config for aboot/uboot
2022-06-02 15:35:17 +08:00
Hua Liu
96954f0134
[swsscommon] Add c++ version sonic-db-cli from sonic-swss-common (#10825)
#### Why I did it
    Fix sonic-db-cli high CPU usage on SONiC startup issue: https://github.com/Azure/sonic-buildimage/issues/10218
    ETA of this issue will be 2022/05/31

#### How I did it
    Re-write sonic-cli with c++ in sonic-swss-common: https://github.com/Azure/sonic-swss-common/pull/607
    Modify swss-common rules and slave.mk to install c++ version sonic-db-cli.
    

#### How to verify it
    Pass all E2E test scenario.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
    Build and install c++ version sonic-db-cli from swss-common.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-06-01 08:05:53 +08:00
Lukas Stockner
c9b27cde71
[swss] Clear VXLAN tunnel table from State DB on startup (#10822)
* When reloading config after crashes, VTEP interfaces are sometimes not created since the tunnel still exists in the STATE_DB.
* Adding VXLAN_TUNNEL_TABLE to the list of tables to be cleaned in swss.sh fixes the problem.
2022-05-31 08:54:31 -07:00
davidpil2002
ab0930313b
[YANG] Add support for Password Hardening (#10322)
- Why I did it
Yang Model about password hardening feature, the sonic CLI of this feature was autogenerated from this Yang model

- How I did it
Create new Yang model in src/sonic-yang-models/yang-models/sonic-passwh.yang.

- How to verify it
There are unitests(yang test) in this P.R covering all the passwords policies with good and bad values cases.
Or is possible manually using the config/show password commands that were autogenerated from this Yang model. (this CLI code added in sonic-utilities)
2022-05-29 13:54:51 +03:00
xumia
f0dfd398a6
Revert "Reduce image size for lazy installation packages (#10775)" (#10916)
This reverts commit 15cf9b0d70.
Why I did it
Revert the PR #10775, for it has impact on onie installation.
It is caused by the symbol links not supported in some of the onie unzip.
We will enable after fixing the issue, see #10914
2022-05-26 09:39:48 +08:00
abdosi
0285bfe42e
[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. (#10500)
What/Why I did:

Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. 

Scenario: For example if let's say database service goes in fail state  then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.

Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.

Issue 2: Use of IPVLAN vs MACVLAN

Currently we are using ipvlan mode.  However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device."  based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN)  it remains in Kernel Database and never goes to free pool even though namespace is deleted. 

Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.

Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis   Server.

Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis  Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle

Solution: Added explicit PING from namespace that will check this reachability.

Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.

Solution: Handle gracefully via try..except and log the message.
2022-05-24 16:54:12 -07:00
Maxime Lorrillere
392899682f
[Arista] Add support for Wolverine linecards (#8887)
Add support for WolverineQCpu, WolverineQCpuMs, WolverineQCpuBk, WolverineQCpuBkMs

Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
2022-05-20 14:11:06 -07:00
Senthil Kumar Guruswamy
f37dd770cd
System Ready (#10479)
Why I did it
At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it
A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it
"show system-health sysready-status" click CLI
Syslogs for system ready
2022-05-20 13:25:11 -07:00
Arun Saravanan Balachandran
f4b22f67a4
[initramfs]: SSD firmware upgrade in initramfs (#10748)
Why I did it
To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration.

How I did it
New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4

A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part'

How to verify it
In SONiC, the SSD firmware updater is copied to “/host/” directory.
Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150)
After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.
2022-05-12 08:11:02 -07:00
Marty Y. Lok
23f9126f59
[VoQ][config] Multiasic Supervisor card fails to load config_db#.json in chassis when system is reboot (#10106)
Supervisor card fails to load config_db#.json in chassis when system reboot. 
This is an intermittent issue, fixes #10105
2022-05-09 11:06:11 -07:00
xumia
15cf9b0d70
Reduce image size for lazy installation packages (#10775)
Why I did it
The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies.
For cisco image, the image size will reduce from 3.5G to 1.7G.

How I did it
Use symbol links to only keep one package for each of the lazy package.
Make a new folder fsroot/platform/common
Copy the lazy packages into the folder.
When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.
2022-05-09 08:26:09 -07:00
xumia
8ec8900d31
Support SONiC OpenSSL FIPS 140-3 based on SymCrypt engine (#9573)
Why I did it
Support OpenSSL FIPS 140-3, see design doc: https://github.com/Azure/SONiC/blob/master/doc/fips/SONiC-OpenSSL-FIPS-140-3.md.

How I did it
Install the fips packages.
To build the fips packages, see https://github.com/Azure/sonic-fips
Azure pipelines: https://dev.azure.com/mssonic/build/_build?definitionId=412

How to verify it
Validate the SymCrypt engine:

admin@sonic:~$ dpkg-query -W | grep openssl
openssl 1.1.1k-1+deb11u1+fips
symcrypt-openssl        0.1

admin@sonic:~$ openssl engine -v | grep -i symcrypt
(symcrypt) SCOSSL (SymCrypt engine for OpenSSL)
admin@sonic:~$
2022-05-06 07:21:30 +08:00
Junchao-Mellanox
681c24878b
Fix race condition between networking service and interface-config service (#10573)
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
2022-05-05 15:21:44 -07:00
shlomibitton
4ec3af86af
[Fastboot] Delay PMON service for better fastboot performance (#10567)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for PMON service.
Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-02 10:44:17 +03:00
shlomibitton
1d84e0d7df
[Fastboot] Delay LLDP service for better fastboot performance (#10568)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
This PR is dependent on PR: #10567
2022-04-28 10:35:14 +03:00
ganglv
9d7387a18e
[sonic-host-services]: Fix import and invalid path (#10660)
Why I did it
Can not start sonic-hostservice

How I did it
Install python3-dbus and systemd-python, and replace invalid path

How to verify it
Start the service with below commands:
sudo systemctl start sonic-hostservice
sudo systemctl status sonic-hostservice

Signed-off-by: Gang Lv ganglv@microsoft.com
2022-04-27 07:14:51 +08:00
Saikrishna Arcot
64187a1b15
Remove SSH host keys after installing the custom version of sshd (#10633)
* Remove SSH host keys after installing the custom version of sshd

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Use an override for for sshd instead of overwriting the service file

Don't overwrite upstream's .service file, and instead use an override
file for making sure the host key(s) are generated.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-25 10:38:52 -07:00
bingwang-ms
3fc3259a35
Define qos map AZURE_TUNNEL for QoS remapping of tunnel traffic (#10565)
* Add AZURE_TUNNEL map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-04-25 15:06:10 +08:00
yozhao101
e24fe9bc60
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-20 18:08:06 -07:00
Samuel Angebault
fb147764b5
[Arista] Fix arista-net initramfs hook (#10624)
The interface renaming logic fails if one interface is missing.
Because of the `set -e` the whole initramfs hook would abort early on
error.
This change fixes the current behavior to make sure missing interfaces
are properly skipped and ensure existing interface are renamed.
2022-04-20 10:03:05 -07:00
Junhua Zhai
128d762af3
[gearbox] Add peer gbsyncd for swss if gearbox exists (#10504)
Fix the issues #10501 and #9733

If having gearbox, we need:
    * add gbsyncd as a peer since swss also has dependency on gbsyncd
    * add service gbsyncd to FEATURE table if it is missing
2022-04-20 19:02:49 +08:00
kellyyeh
2a516a7763
[dhcp_relay] Enable dhcp_relay on EPMS, MgmtTsTor, MgmtToRRouter and BackEndToRRouter (#10474) 2022-04-15 18:01:24 -07:00
Yakiv Huryk
d9117d9411
[Mellanox][asan] add address sanitizer support for syncd (#10266)
Why I did it
To support address sanitizer for Mellanox syncd

How I did it
/var/log/asan is mapped for syncd container (the same as for swss)
container stop() has a timeout (60s) for syncd (the same as for swss)
This is so libasan has enough time to generate a report.
added ASAN's log path to Mellanox syncd supervisord.conf
added "asan: yes" to sonic_version.yml
How to verify it
Added artificial memory leaks
Compiled with ENABLE_ASAN=y
Installed the image on DUT
Rebooted the DUT
Verified that /var/log/asan/syncd-asan.log contains the leaks

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-04-14 15:00:32 -07:00
Saikrishna Arcot
12ebe3ffa0
Run tune2fs during initramfs instead of image install (#10536)
If it is run during image install, it's not guaranteed that the
installation environment will have tune2fs available. Therefore, run it
during initramfs instead.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-12 16:24:13 -07:00
byu343
f7a6553933
[docker-syncd]: Add optional shm-size to syncd container (#10516)
Why I did it
In the bringup of tomahawk4/trident4, we realized that such chips need a larger size of /dev/shm in syncd container, so we added the option --shm-size to the docker create for syncd. The default value for shm-size is 64m; after this change, people can add SYNCD_SHM_SIZE=128m to platform_env.conf to change it to 128m.

How to verify it
We verified that after this change, 1) on existing platforms without platform_env.conf, the size of /dev/shm in syncd container (df -h | grep shm) is still the default 64M; 2) after we add SYNCD_SHM_SIZE=128m to platform_env.conf, /dev/shm in syncd becomes 128M.
2022-04-09 10:47:18 -07:00
Vivek R
ed14eb5263
[interfaces-config] "main exception: cannot find interfaces: eth0" error log avoided (#10463)
- Why I did it
Fixes #9628
During bootup, this error log is seen

Dec 22 04:26:29 sonic interfaces-config.sh[2546]: error: main exception: cannot find interfaces: eth0 (interface was probably never up ?)
This is of non-functional nature and doesn't affect the flow.

- How I did it
Dont take the ifdown if not needed

- How to verify it
Verified during reboot. Log did not appear and IP was acquired on eth0 as expected

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2022-04-06 16:59:47 +03:00
bingwang-ms
b9dd1df372
Update qos config to clear queues for bounced back traffic (#10176)
* Update qos config to clear queues for bounced back traffic

Signed-off-by: bingwang <bingwang@microsoft.com>
2022-04-05 22:32:25 +08:00
judyjoseph
8e642848c2
Introduce the asic_subtype field for adding the sub platform variants. (#10235)
* Introduce the asic_subtype field for adding the sub platform variants. 
   It uses the value of TARGET_MACHINE variable in slave.mk.
2022-03-28 11:22:32 -07:00
Santhosh Kumar T
e2502edefd
Refactoring DELL platform init to reduce rc.local processing time porting changes in master (#10318)
Why I did it
To reduce the processing time of rc.local, refactoring s6100 platform initialization.
Porting changes from 202012 branch [202012] Refactoring DELL platform init to reduce rc.local processing time #10171
2022-03-24 11:14:37 -07:00
xumia
e9ac13678d
[Build]: Fix armhf mirrors not existing issue (#10312)
Why I did it
[Build]: Fix armhf mirrors not existing issue
The mirror endpoint debian-archive.trafficmanager.net does not support armhf, change to use deb.debian.org and security.debian.org.
2022-03-22 15:24:15 +08:00
Kostiantyn Yarovyi
bf4ab4a338
[Barefoot][Syncd] restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC (#9941)
Why I did it
improvement of starting barefoot SDK

How I did it
restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC

How to verify it
run sonic autorestart tests
2022-03-21 10:07:20 -07:00
Samuel Angebault
e4b507fa03
[Arista] rename management interface in initrd (#9856)
On some products the pci enumeration adds randomness into which nic gets
initialized first.
Because SONiC doesn't use deterministic interface naming but instead old
style interface naming, this leads to eth0 not always being the
management port.
To make sure eth0 is always the management port (SONiC expectation)
rename the interfaces in the initramfs for Arista products.
2022-03-21 17:55:23 +05:30
xumia
1017ee6002
[Build]: Use one debian mirror config (#10274)
Why I did it
Use one debian mirror config.
The empty config in https://github.com/Azure/sonic-buildimage/blob/master/files/image_config/apt/sources.list overrides the file https://github.com/Azure/sonic-buildimage/blob/master/files/apt/sources.list.amd64 (armhf/arm64), it does not make sense.
All the content in files/image_config/apt is no use, any one wants to add mirror config, please add in files/apt.

How I did it
Remove files/image_config/apt and the reference.
2022-03-21 16:47:20 +08:00
Saikrishna Arcot
5617b1ae3e
Image disk space reduction (#10172)
# Why I did it

Reduce the disk space taken up during bootup and runtime.

# How I did it

1. Remove python package cache from the base image and from the containers.
2. During bootup, if logs are to be stored in memory, then don't create the `var-log.ext4` file just to delete it later during bootup.
3. For the partition containing `/host`, don't reserve any blocks for just the root user. This just makes sure all disk space is available for all users, if needed during upgrades (for example).


* Remove pip2 and pip3 caches from some containers

Only containers which appeared to have a significant pip cache size are
included here.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Don't create var-log.ext4 if we're storing logs in memory

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Run tune2fs on the device containing /host to not reserve any blocks for just the root user

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-03-15 18:12:49 -07:00
Stepan Blyshchak
18d00dfbe7
[teamd.sh] kill teamd docker on warm shutdown for faster shutdown (#10219)
This can save 6 sec for teamd LAG restoration - the time between:

```
Mar  9 13:51:10.467757 r-panther-13 WARNING teamd#teamd_PortChannel1[28]: Got SIGUSR1.
Mar  9 13:52:33.310707 r-panther-13 INFO teamd#teamd_PortChannel1[27]: carrier changed to UP
```

- Why I did it
Optimize warm boot. Specifically reduce the time needed for LAG restoration.

- How I did it
Kill teamd docker after graceful shutdown of teamd processes.

- How to verify it
Run warm reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-15 09:20:36 +02:00
xumia
0243ed9538
[build]: Fix marvell-armhf build hung issue (#10156) (#10229)
Why I did it
The marvel-armhf build is hung, it does not exit after waiting for a long time.
It is caused by the process /etc/entropy.py which is started by the postinst script in target/debs/buster/sonic-platform-nokia-7215_1.0_armhf.deb
2022-03-15 10:03:54 +08:00
Saikrishna Arcot
d7c3ce0045
Specify the filesystem type when mounting to /host (#10169)
When mounting the partition that contains `/host` during initramfs, the
mount binary available there (coming from busybox) tries each filesystem
in `/proc/filesystems` and sees which one succeeds. During this time,
there may be some error messages logged into dmesg because some of the
incorrect filesystems failed to mount the partition.

Specify the filesystem type explicitly so that initramfs knows it's that
type, and we know what filesystem will always get used there.

Fixes #9998

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-03-14 11:34:02 -07:00
Stepan Blyshchak
2919b4820f
[hostcfgd] record feature state in STATE DB (#9842)
- Why I did it
To implement blocking feature state change.

- How I did it
Record the actual feature state in STATE DB from hostcfg.

- How to verify it
UT + verification by running on the switch and checking STATE DB.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-14 13:45:27 +02:00
xumia
eea3cc7ad1
[Build]: only install grpc in amd64 (#10212)
[Build]: only install grpc in amd64
Unblock marvell-armhf build.
2022-03-14 13:41:37 +08:00
Samuel Angebault
8d419ca2c5
[Arista] Remove arista.log from rsyslog default logrotate (#9731)
Why I did it
In parallel of this change Arista added a custom logrotate configuration as part of its driver library.
Having 2 logrotate configuration for the same log file triggers an issue.

Fixes aristanetworks/sonic#38

How I did it
Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797
It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration.

How to verify it
Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.
2022-03-11 08:09:07 -08:00
xumia
9cdf81230b
[Build]: Fix /proc not mounted issue (#10164)
[Build]: Fix /proc not mounted issue
2022-03-11 09:23:37 +08:00
Song Yuan
01798447ab
[Chassis][QoS template] Skip configuring buffer and QoS config on recirc ports (#7869)
* Added test case to verify the template changes.
2022-03-09 16:04:36 -08:00
Kebo Liu
fe0a7693f4
[smartmontools] Install smartmontools with apt-get and upgrade it to 7.2-1 (#10087)
Why I did it
Smartmontools 6.6 has an issue with reading SMART info of nvme SSD
Smartmontools can be installed with apt-get, no need to build and install

How I did it
Use apt-get to install smartmontools 7.2-1
Remove previous make files for smartmontools 6.6

How to verify it
verify with "smartctl" can read out correct SMART info on NVME ssd.
verify "show platform ssdhealth" can still work

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-03-07 09:39:33 -08:00
Marty Y. Lok
c40f04f0e2
[chassis][supervisor]monit container-checker failed due to unexpected "database-chassis" docker running #9042 (#9043)
Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it
Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it
Execute the CLI command "sudo monit status container_checker"


Signed-off-by: mlok <marty.lok@nokia.com>
2022-03-03 17:56:08 -08:00
Aravind Mani
1740beb1f2
[sonic-cfggen]: Fix sonic-cfggen build failures for armhf (#10132)
Why I did it
amrhf build fails while building sonic-config-engine whl package
https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_apis/build/builds/77089/logs/9

The reason for the failure is due to the fact that there is a new line generated at the top of the file in buffer config test cases while building for broadcom based platform and this issue is not seen in Marvell based platforms.

How I did it
Removed the new line for all the buffer test cases as there is no need to add it and accordingly changed the buffer_config.j2 where the new line is generated.
2022-03-02 13:06:20 -08:00
Lawrence Lee
a50d1f1fc8
[write_standby]: Increase timeout to 60s (#10065)
- Avoid scenarios where script times out before orchagent can establish IPinIP tunnel

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-02-24 14:55:45 -08:00
wenyiz2021
2d0b063191
Update container_checker for multi-asic devices when state is 'always_enabled' (#10067)
* Update container_checker for multi-asic devices 

Update container_checker for multi-asic devices to add database containers in always_running_containers. 
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.

* Update container_checker

Update an indent
2022-02-23 18:06:30 -08:00
vmittal-msft
bc1dfea619
Updated traffic scheduler settings for HWSKUs : DellEMC-Z9332f-O32 and DellEMC-Z9332f-M-O16C64 (#9828) 2022-02-23 17:22:41 -08:00
Stepan Blyshchak
fb752a4ae5
[rsyslog.j2] fix typo in VAR_LOG_SIZE_KB (#9954)
This issue causes negative threshold value and thus deleting log files even when there is enough space.

This issue causes negative threshold value and thus deleting log files even when there is enough space.

- Why I did it
To fix an issue when log files get deleted even if there is enough space.

- How I did it
Fixed an typo.

- How to verify it
Run the portion of the script that calculates threshold, see that the threshold is calculated correctly.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-02-17 10:16:44 +02:00
byu343
155220be9b
Support multi-asic on macsec container (#9921)
This change enables the support of running multiple macsec containers, each for one ASIC.
2022-02-13 22:45:24 -08:00
Oleksandr Ivantsiv
25a0ce5eb1
[asan] Add address sanitizer support. (#9857)
Implement infrastructure that allows enabling address sanitizer
for docker containers. Enable address sanitizer for SWSS container.

- Why I did it
To add a possibility to compile SONiC applications with address sanitizer (ASAN).
ASAN is a memory error detector for C/C++. It finds:
1. Use after free (dangling pointer dereference)
2. Heap buffer overflow
3. Stack buffer overflow
4. Global buffer overflow
5. Use after return
6. Use after the scope
7. Initialization order bugs
8. Memory leaks

- How I did it
By adding new ENABLE_ASAN configuration option.

- How to verify it
By default ASAN is disabled and the SONiC image is not affected.
When ASAN is enabled it inspects all allocation, deallocation, and memory usage that the application does in run time. To verify whether the application has memory errors tests that trigger memory usage of the application should be run. Ideally, the whole regression tests should be run. Memory leaks reports will be placed in /var/log/asan/ directory of SONiC host OS.

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
2022-02-09 13:29:18 +02:00
Prince George
ff14aebef9
Close console session due to user inactivity (#9890)
Signed-off-by: Prince George <prgeor@microsoft.com>
2022-02-02 09:41:21 +05:30
tbgowda
4e32f85a31
Enable SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL attribute (#9419)
Why I did it
Fixes #8980 partly.

The corresponding changes in sonic-sairedis is here :
Azure/sonic-sairedis#975

How I did it
Include changes from both repos and build an image for verification.

How to verify it
Trigger fast-reboot with the changes, see the attribute SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL being set at the SAI level.

Signed-off-by: Thushar Gowda <24815472+tbgowda@users.noreply.github.com>
2022-02-01 08:44:17 -08:00
Alexander Allen
8a07af95e5
[Mellanox] Modified Platform API to support all firmware updates in single boot (#9608)
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.

How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
2022-01-24 00:56:38 -08:00
dflynn-Nokia
b6939b9927
[firsttime boot] suppress error message on platforms not supporting kdump (#9521)
Why I did it
Eliminate benign firsttime boot error reported when running on platforms that do not support kdump.

How I did it
Change rc.local to check for presence of the file /etc/default/kdump-tools before referencing it.

How to verify it
Install a new image on an armhf or arm64 platform and check for a failed reference to /etc/default/kdump-tools on firsttime boot.
2022-01-20 18:27:10 -08:00
Shyam
20f32dc072
Added gbsyncd infra for multi-ASIC, multi-PHY mode (#9722)
- External PHY is managed via gearbox (gbsybcd docker container) in SONiC
  - Enhanced 'External PHY management' from SONiC's single-ASIC environment to multi-ASIC
  - Enhanced gbsyncd docker container from single Namespace to multi-Namspace mode
  - Added gbsyncd.service.j2 on per_namespace basis.
  - Each namepace/ASIC now to have its unique gbsyncd<ASIC#> docker container with its
    own Gearbox table, redis-DB

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
2022-01-21 10:08:16 +08:00
Alexander Allen
5f596aef63
[pmon] Move smartctl from pmon to host (#9607)
Why I did it
Need to be able to run smartctl when pmon docker is not running.

How I did it
Removed the pmon dependency for pmon as well as the command wrapper and added it to the debian-extension.

How to verify it
Stop pmon
Run smartctl from the host and verify it runs without error
2022-01-19 10:53:10 -08:00
liuh-80
f166b991a7
[image]: Prevent radius passkey and snmp community string into syslog. (#9727)
[image]: Prevent radius passkey and snmp community string into syslog.  (#9727)

#### Why I did it
    Prevent radius passkey and snmp community string into syslog.

#### How I did it
    Add radius and snmp config command to PASSWD_CMDS

#### How to verify it
    Run and pass all UTs.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106

#### Description for the changelog
    Add radius and snmp config command to PASSWD_CMDS to prevent radius passkey and snmp community string into syslog.

#### A picture of a cute animal (not mandatory but encouraged)
2022-01-17 16:26:22 +08:00
Sudharsan Dhamal Gopalarathnam
bd0a19aa17
[rsyslog]Setting log file size to 16Mb (#9504)
Why I did it
The existing log file size in sonic is 1 Mb. Over a period of time this leads to huge number of log files which becomes difficult for monitoring applications to handle.
Instead of large number of small files, the size of the log file is not set to 16 Mb which reduces the number of files over a period of time.

How I did it
Changed the size parameter and related macros in logrotate config for rsyslog

How to verify it
Execute logrotate manually and verify the limit when the file gets rotated.

Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
2022-01-14 10:24:07 -08:00
Marty Y. Lok
04a4b8dcb1
[multiasic][database]database.sh failed to create the database for namespace (#9502)
Why I did it
database.sh failed to create the database for namespace in multiasic platform.
The latest code Docker version 20.10.x, command "docker create" no longer takes optional "NET=" with empty value. Syntax error show with current docker create command in database.sh. Issue #9503

How I did it
Modify the docker_image_ctl.j2 to set default network setting NET="bridge" instead of empty for namespace database.
2021-12-13 10:17:05 -08:00
Qi Luo
cf4011d526
Revert "CRM init config for SRV6 Nexthop and MY_SID resource (#9238)" (#9506)
This reverts commit 8187d473af.
2021-12-12 12:16:39 -08:00
Samuel Angebault
d499455752
[Arista] Update driver submodules (#9393)
- Use SfpOptoeBase by default to leverage new `sonic_xcvr` refactor
 - Add support for `Woodleaf` product
 - Move `libsfp-eeprom.so` to a different `.deb` package
 - Add new logrotate configuration for arista logs
 - Improve logging mechanism for the drivers (IO loglevel, fix syslog duplicates)
 - Initialize chassis cards in parallel
 - Refactor of `get_change_event` to fix interrupts treated as presence change
2021-12-08 11:33:36 -08:00
Brian O'Connor
46bcda359c
[PINS] Build P4RT container for PINS (#9083)
- Add INCLUDE_PINS to config to enable/disable container
- Add Docker files and supporting resources
- Add sonic-pins submodule and associated make files

Submission containing materials of a third party:
    Copyright Google LLC; Licensed under Apache 2.0

#### Why I did it

Adds P4RT container to SONiC for PINS

The P4RT app is covered by this HLD:
https://github.com/pins/SONiC/blob/master/doc/pins/p4rt_app_hld.md

#### How I did it

Followed the pattern and templates used for other SONiC applications

#### How to verify it

Build SONiC with INCLUDE_P4RT set to "y".
Verify that the resulting build has a container called "p4rt" running.
You can verify that the service is up by running the following command on the SONiC switch:
```bash
sudo netstat -lpnt | grep p4rt
```
You should see the service listening on TCP port 9559.

#### Which release branch to backport (provide reason below if selected)

None

#### Description for the changelog

Build P4RT container for PINS
2021-12-07 11:11:25 -08:00
Marty Y. Lok
cb4c66ae98
[chassis][multiasic] fixed rsyslogd FATAL issue in the database container in multi-asic box (#8390)
Why I did it
Fix for issue #8389

How I did it
The /etc/rsyslog.conf is empty file which cause the FATAL of the process rsyslogd in the global instance database container. The function updateSyslogConf() should only generate the rsyslog.conf for containers in the namespace. it should not do it for the containers in the global instance. Instead, default rsyslog.conf should be used. Especially for database container, updateSyslogConf() is called before the database container is created. The result cause the sonic-cfggen failed to generate the rsyslog.conf.Why I did it
Fix for issue #8389

How I did it
The /etc/rsyslog.conf is empty file which cause the FATAL of the process rsyslogd in the global instance database container. The function updateSyslogConf() should only generate the rsyslog.conf for containers in the namespace. it should not do it for the containers in the global instance. Instead, default rsyslog.conf should be used. Especially for database container, updateSyslogConf() is called before the database container is created. The result cause the sonic-cfggen failed to generate the rsyslog.conf.

Signed-off-by: mlok <marty.lok@nokia.com>
2021-12-01 07:16:49 -08:00
liuh-80
739c45645c
[TACACS+] Add audisp-tacplus for per-command accounting. (#8750)
This pull request integrate audisp-tacplus to SONiC for per-command accounting.

#### Why I did it
To support TACACS per-command accounting, we integrate audisp-tacplus project to sonic.

#### How I did it
1. Add auditd service to SONiC
2. Port and patch audisp-tacplus to SONiC

#### How to verify it
UT with CUnit to cover all new code in usersecret-filter.c
Also pass all current UT.

#### Which release branch to backport (provide reason below if selected)
N/A

#### Description for the changelog
Add audisp-tacplus for per-command accounting.

#### A picture of a cute animal (not mandatory but encouraged)
2021-12-01 11:50:09 +08:00
noaOrMlnx
0908f9ec49
[CoPP] Add always_enabled field (#9302)
*Add the "always_enabled" field to copp_cfg.j2 file, in order to allow traps without an entry in features table, to be installed automatically.
2021-11-30 11:04:15 -08:00
Kumaresh Perumal
8187d473af
CRM init config for SRV6 Nexthop and MY_SID resource (#9238)
*Enable CRM for SRV6 Nexthop and SRV6 MY_SID entries.
2021-11-30 09:21:19 -08:00
Shi Su
4b357044b3
[bgpcfgd] Add bgpcfgd support to advertise routes (#9197)
Why I did it
Add bgpcfgd support to advertise routes.

How I did it
Make bgpcfgd subscribe to the ADVERTISE_NETWORK table in STATE_DB and configure route advertisement accordingly.

How to verify it
Added unit tests in bgpcfgd and verify on KVM about route advertisement.
2021-11-29 23:17:57 -08:00
Lawrence Lee
6e1a477ce0
[mux]: Fix mark_dhcp_packet (#9373)
- Consolidate the two [Service] sections by moving the ExecStartPre line for mark_dhcp_packet.py to the first section and removing the second.
- Make the mark_dhcp_packet.py file executable
- Also clean up mark_dhcp_packet.py
    - Remove unused imports
    - Fix spacing and line lengths to conform to PEP8
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-29 12:04:06 -08:00
Brian O'Connor
002827f08e
[PINS] Add APPL_STATE_DB and response path log (#9082)
- Add APPL_STATE_DB to database_config.json
- Clear APPL_STATE_DB during SwSS container restarts
- Add response path log file to logrotate config: responsepublisher.rec

Co-authored-by: PINS Working Group <sonic-pins-subgroup@googlegroups.com>
2021-11-24 10:31:06 -08:00
Stephen Sun
b3ccef9c08
[Reclaim buffer] Common infrastructure update for reclaiming buffer (#9133)
- Why I did it
This is to update the common sonic-buildimage infra for reclaiming buffer.

- How I did it
Render zero_profiles.j2 to zero_profiles.json for vendors that support reclaiming buffer
The zero profiles will be referenced in PR [Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles #8768 on Mellanox platforms and there will be test cases to verify the behavior there.
Rendering is done here for passing azure pipeline.
Load zero_profiles.json when the dynamic buffer manager starts
Generate inactive port list to reclaim buffer

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-11-24 15:00:23 +02:00
Junhua Zhai
240596ec7d
[gearbox] provide common gbsyncd.service.j2 to start for platform specific gbsyncd docker (#9332)
Why I did it
Fix #9059. It provides common gbsyncd.service.j2 to start for platform specific gbsyncd docker, which must be named 'gbsyncd'.

How I did it
All of platform specific gbsyncd dockers use a common name 'gbsyncd'
Use a unique systemd service template gbsyncd.service.j2 for gbsyncd docker
2021-11-23 10:44:29 -08:00
Guohan Lu
f3faf6111b Revert "[gearbox] provide common gbsyncd.service.j2 to start for platform specific gbsyncd docker (#9286)"
This reverts commit 1d2a11bbb8.
2021-11-19 10:10:55 -08:00
Junhua Zhai
1d2a11bbb8
[gearbox] provide common gbsyncd.service.j2 to start for platform specific gbsyncd docker (#9286)
Why I did it
Fix #9059. It provides common gbsyncd.service.j2 to start for platform specific gbsyncd docker, which must be named 'gbsyncd'.

How I did it
All of platform specific gbsyncd dockers use a common name 'gbsyncd'
Use a unique systemd service template gbsyncd.service.j2 for gbsyncd docker
2021-11-17 23:49:49 -08:00
Vivek Reddy
ff32ac3ed4
[Auto Techsupport] Event driven Techsupport Changes (#8670)
#### Why I did it

Changes required for feature "Event Driven TechSupport Invocation & CoreDump Mgmt". [HLD](https://github.com/Azure/SONiC/pull/818 )

Requires: https://github.com/Azure/sonic-utilities/pull/1796.
Merging in any order would be fine.

Summary of the changes:

- Added the YANG Models for the new tables introduces as a part of this feature.
- Enhanced init_cfg.json with the default config required
- Added a compile Time flag which enables/disables the config required for this feature inside the init_cfg.json
- Enhanced the supervisor-proc-exit-listener script to populate `<feature>:<critical_proc> = <comm>:<pid>` info in the STATE_DB when it observes an proc exit notification for the critical processes running inside the docker.
2021-11-15 21:56:37 -08:00
Renuka Manavalan
a685fe1765
add arista.log to logrotate (#9245) 2021-11-15 07:29:30 -08:00
liuh-80
ff09b8b8ed
[TACACS+] Add Bash TACACS+ plugin for per-command authorization. (#8715)
This pull request add a bash plugin for TACACS+ per-command authorization

#### Why I did it
1. To support TACACS per command authorization, we check user command before execute it.
2. Fix libtacsupport.so can't parse tacplus_nss.conf correctly issue:
            Support debug=on setting.
            Support put server address and secret in same row.
3. Fix the parse_config_file method not reset server list before parse config file issue.

#### How I did it
The bash plugin will be called before every user command, and check user command with remote TACACS+ server for per-command authorization.

#### How to verify it
UT with CUnit cover all code in this plugin.
Also pass all current UT.

#### Which release branch to backport (provide reason below if selected)
N/A

#### Description for the changelog
Add Bash TACACS+ plugin.


#### A picture of a cute animal (not mandatory but encouraged)
2021-11-13 09:57:30 +08:00
Stepan Blyshchak
a2c2d67098
[ACL] enable ACL FC when genereting config from minigraph but disable by default (#8908)
* [ACL] enable ACL FC when genereting config from minigraph but disable by default
Why I did it
To support ACL counters on Flex Counter Infrastructure.

How I did it
Enable ACL FC in init_cfg and minigraph. Disable when genereting configuration from preset.

How to verify it
Together with depends PRs. Run ACL/Everflow test suite.

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-11-11 09:07:54 +08:00
Guohan Lu
5f11eb320e Revert "sysready (#8889)"
This reverts commit d7e5372e54.
2021-11-10 15:36:20 -08:00
Alexander Allen
2847265bfd Mellanox bullseye merge (#1)
Allow mellanox platform to build and successfully switch packets in
Debian 11

Upgraded

* Mellanox SDK
* Mellanox Hardware Management
* Mellanox Firmware
* Mellanox Kernel Patches

Adjusted build system to support host system running bullseye and
dockers running buster.
2021-11-10 15:27:22 -08:00
LuiSzee
5b284767f6 Update Centec platform support for Bullseye and 5.10 kernel (#7)
1. Fix build for armhf and arm64
2. upgrade centec tsingma bsp support to 5.10 kernel
3. modify centec platform driver for linux 5.10

Co-authored-by: Shi Lei <shil@centecnetworks.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
1d00613305 Add support for building Mellanox image
ISSU will likely be broken. As of right now, the issu-version file is
not being generated during build.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
33e4b7f90e Fix Python 3 syntax in SONiC container startup scripts
The common startup script used for SONiC containers is calling an inline
python command that uses Python 2 syntax, and thus errors out when run
with Python 3. Make this work with Python 3.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
fb03bd2440 Get packages for the base image from the main repos instead of our mirror
There appears to be some network issue in the pipeline builds when
downloading packages from our mirror. Change the source to be from the
main debian repos to try to get around this issue.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
2b0ad74db6 Update kdump-tools for bullseye
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
a1d30e3aa0 Python 2 removal/cleanup
Remove Python 2 package installation from the base image. For container
builds, reference Python 2 packages only if we're not building for
Bullseye.

For libyang, don't build Python 2 bindings at all, since they don't seem
to be used.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
b8a7a6355b Update the base Debian system installation script to get Bullseye
Python 2 is no longer available, so remove those packages, and remove
the pip2 commands. For picocom and systemd, just install from the
regular repo, since there's no backports yet.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Senthil Kumar Guruswamy
d7e5372e54
sysready (#8889) 2021-11-10 14:52:52 -08:00
Lawrence Lee
475bfc9625
[mux.service]: Remove pmon dependency (#9211)
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 08:08:03 -08:00
tjchadaga
8544147a70
Fix for additional intf flap during fast-reboot (#9166) 2021-11-08 15:21:11 -08:00
abdosi
ea91a72b79
[multi-asic] fix syslog not getting generated. (#9160)
Fixes #9159
2021-11-03 18:29:09 -07:00
trzhang-msft
689c101095
update DHCP_PACKET_MARK schema (#9077)
- update DHCP_PACKET_MARK schema in state_db
- this is an update over PR: Add service mark_dhcp_packet to mux container #9015
2021-11-02 15:55:50 -07:00
Stepan Blyshchak
2ef97bb5df
[dockers] change RPC, DBG dockers version: put RPG, DBG sign in build metadata part of the version (#8920)
- Why I did it
In case an app.ext requires a dependency syncd^1.0.0, the RPC version of syncd will not satisfy this constraint, since 1.0.0-rpc < 1.0.0. This is not correct to put 'rpc' as a prerelease identifier. Instead put 'rpc' as build metadata in the version: 1.0.0+rpc which satisfies the constraint ^1.0.0.

- How I did it
Changed the way how to version in RPC and DBG images are constructed.

- How to verify it
Install app.ext with syncd^1.0.0 dependency on a switch with RPC syncd docker.
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-11-01 19:02:57 +02:00
Cosmin-Jinga-MS
dfc1697045
[CBF] Added configuration templates to generate configs for CBF (#8689)
Updated CBF config packaging
[build_templates]: Added default configuration file for CBF
[rules]: Added loading rule for CBF config

 The CBF default config is required to load default start-up config on CBF capable platforms
2021-10-29 17:18:57 -07:00
Sachin Naik
99dcc831f2
[gearbox] Add gbsyncd container for Credo gearbox chips (#9009)
Enable gbsyncd support for cisco platforms

Signed-off-by: Sachin Naik sachnaik@cisco.com

Why I did it
To enable cisco gbsyncd container for cisco gearbox hardwares.

How I did it
Create symlink to gbsyncd.service.j2 to start gearbox systemd service.

How to verify it
Verify that the gbsyncd-cisco container started for x86_64-88_lc0_36fh_mo-r0 Line card

root@localhost:/home/cisco# docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
50d309ea9967        docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   26 minutes ago      Up 6 minutes                            telemetry
65cebc9e181b        docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   26 minutes ago      Up 6 minutes                            mgmt-framework
5a9b510da24d        docker-snmp:latest                   "/usr/local/bin/supe…"   26 minutes ago      Up 6 minutes                            snmp
c291b0a1fc87        26195cc7c042                         "/usr/bin/docker_ini…"   26 minutes ago      Up 6 minutes                            dhcp_relay
d85aa5e6b78c        docker-router-advertiser:latest      "/usr/bin/docker-ini…"   28 minutes ago      Up 6 minutes                            radv
46c787329374        docker-lldp:latest                   "/usr/bin/docker-lld…"   28 minutes ago      Up 6 minutes                            lldp
6643f53e4ceb        docker-gbsyncd-cisco:latest          "/usr/local/bin/supe…"   28 minutes ago      Up 6 minutes                            gbsyncd-cisco
f05ae8af4aaa        docker-syncd:latest                  "/usr/local/bin/supe…"   28 minutes ago      Up 6 minutes                            syncd
02e0e53b62cf        docker-teamd:latest                  "/usr/local/bin/supe…"   28 minutes ago      Up 6 minutes                            teamd
fc7bc2dbb6a9        docker-orchagent:latest              "/usr/bin/docker-ini…"   28 minutes ago      Up 6 minutes                            swss
5c5147c986c9        docker-fpm-frr:latest                "/usr/bin/docker_ini…"   28 minutes ago      Up 6 minutes                            bgp
63b5ce3d4c80        docker-platform-monitor:latest       "/usr/bin/docker_ini…"   28 minutes ago      Up 6 minutes                            pmon
7e6f34dca0e5        docker-database:latest               "/usr/local/bin/dock…"   28 minutes ago      Up 29 minutes                           database


Signed-off-by: Sachin Naik <sachnaik@cisco.com>

Co-authored-by: Sachin Naik <sachnaik@cisco.com>
2021-10-27 12:35:47 +08:00
Stepan Blyshchak
4ad5f2af3f
[swss.sh] fix an issue that dependent services are not read from a file (#8943)
This is due to the SERVICE variable declared after reading a file

#### Why I did it

To fix an issue that dhcp_relay does not restart with swss.

#### How I did it

Fixed in the swss.sh script

#### How to verify it

sudo systemctl restart swss
verify dhcp_relay restarts as well.
2021-10-26 19:01:30 -07:00
Maxime Lorrillere
81f4fca3dc
Allow database instances on multi-asic linecards to connect to chassis DB (#8583)
Add code to interfaces-config.sh to configure eth1 in multi-asic
containers so that they can access midplane subnet.

Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
2021-10-26 18:27:09 -07:00
Marty Y. Lok
b91190d82d
[Nokia] Add protobuf and grpc C++ and python lib to support Nokia IXR7250E platform (#8366)
#### Why I did it
Nokia IXR7250E platform requires grpcio, grpcio-tools python library, and libprotobuf-dev, libgrpc++ library  

#### How I did it
Modified the build_debian.sh install libprotobuf-dev and libgrpc++ to support nokia ndk
Modified the sonic_debian_extension.j2 to install the grpcio and grpcio-tools in the host
Modified the docker-platform-monitor/Dockerfile.js to install grpcio and grpcio-tools for the pmon container.

#### How to verify it
Image running success.
2021-10-26 18:09:32 -07:00
trzhang-msft
4e0c4fb832
Add service mark_dhcp_packet to mux container (#9015)
- add a new service "mark_dhcp_packet" to mux container
- apply packet marks on a per-interface basis in ebtables
- write packet marks to "DHCP_PACKET_MARK" table in state_db
2021-10-26 14:10:13 -07:00
Nazarii Hnydyn
453346f8df
[teamd]: Send USR1/USR2 only to subscribers. (#8856)
To fix teamd signal handling, without which Process 'tlm_teamd' exited unexpectedly
2021-10-26 09:12:07 -07:00
Sumukha Tumkur Vani
3971c20001
Flush RESTAPI_DB when config reload is performed (#9037) 2021-10-22 11:45:19 -07:00
Lawrence Lee
d5834fcb1b Merged PR 4679112: [write_standby]: Ignore non-auto interfaces
[write_standby]: Ignore non-auto interfaces

* In the event that `write_standby.py` is used to automatically switchover interfaces when linkmgrd or bgp crashes, ignore any interfaces that are not configured to auto-switch

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
17cbfc44e6 Merged PR 4559560: [bgp]: Switch to standby if BGP container exits
[bgp]: Switch mux to standby if BGP container exits

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
69bae5b27a [write_standby]: Improve logging
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
fad5ec47b4 [mux]: Call write_standby from host only
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
5232647b33 [mux]: Make write_standby available on host
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>

[write_standby]: Cleanup and fix build

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Tamer Ahmed
b880f9d973 Merged PR 4813977: [mux] Update Service Install With SONiC Target
[mux] Update Service Install With SONiC Target

Recent PR grouped all SONiC service into sonic.taget. The install section
of mux.service was not update and this causes delays when using config
reload as the service failed state is not being reset.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
0295c832c2 Merged PR 4366316: [mux.service]: Bind to sonic.target
[mux.service]: Bind to sonic.target

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Tamer Ahmed
bff785ec49 Merged PR 4234524: [mux] Start Mux on Only Dual-ToR Platform
[mux] Start Mux on Only Dual-ToR Platform

mux docker depends on the presence of mux cable hardware and is
supposed to run only Gemini ToRs. This PR change the mux feature
config in order to enable mux docker based on device configuration.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-10-15 09:59:59 -07:00
Tamer Ahmed
c9c2826520 Merged PR 3845699: [linkmgrd]: Introduce MUX cable linkmgrd
Linkmgrd monitors link status, mux status, and link state. Has
the link becomes unhealthy, linkmgrd will trigger mux switchover
on a standby ToR ensuring uninterrupted service to servers/blades.
This PR is initial implementation of linkmgrd.

Also, docker-mux container hold packages related to maintaining and managing
mux cable. It currently runs linkmgrd binary that monitor and switches
the mux if needed.
This PR also introduces mux-container and starts linkmgrd as startup when
build is configured with INCLUDE_MUX=y

Edit: linkmgrd PR will follow.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>

Related work items: #2315, #3146150
2021-10-15 09:59:59 -07:00
Ying Xie
638c287837
[copp] bind copp-config.service to sonic.target (#8969)
copp-config service needs to be started after sonic.target so that it could
render the copp-config with the latest information.

It also needs to be restarted when config reload or load_minigraph is invoked.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2021-10-13 21:07:44 -07:00
liuh-80
7d40384c58
[TACACS+] Add plugin support to bash. (#8660)
This pull request add plugin support library to bash.
    And we will create a TACACS+ plugin for bash in an other PR, which will bring per command authorization feature to bash.

Why I did it
    To support TACACS per command authorization, we check user command before execute it.

How I did it
    Add plugin support to bash.

How to verify it
    UT with CUnit under bash project cover all new code in plugin.c.
    Also pass all current UT.

Which release branch to backport (provide reason below if selected)
    N/A

Description for the changelog
    Add plugin support to bash.
2021-10-11 15:20:51 +08:00
Ashok Daparthi-Dell
6cbdf11e53
SONIC QOS YANG - Remove qos tables field value refernce format (#7752)
Depends on Azure/sonic-utilities#1626
Depends on Azure/sonic-swss#1754

QOS tables in config db used ABNF format i.e "[TABLE_NAME|name] to refer fieldvalue to other qos tables.

Example:
Config DB:
"Ethernet92|3": {
"scheduler": "[SCHEDULER|scheduler.1]",
"wred_profile": "[WRED_PROFILE|AZURE_LOSSLESS]"
},
"Ethernet0|0": {
"profile": "[BUFFER_PROFILE|ingress_lossy_profile]"
},
"Ethernet0": {
"dscp_to_tc_map": "[DSCP_TO_TC_MAP|AZURE]",
"pfc_enable": "3,4",
"pfc_to_queue_map": "[MAP_PFC_PRIORITY_TO_QUEUE|AZURE]",
"tc_to_pg_map": "[TC_TO_PRIORITY_GROUP_MAP|AZURE]",
"tc_to_queue_map": "[TC_TO_QUEUE_MAP|AZURE]"
},

This format is not consistent with other DB schema followed in sonic.
And also this reference in DB is not required, This is taken care by YANG "leafref".

Removed this format from all platform files to consistent with other sonic db schema.
Example:
"Ethernet92|3": {
"scheduler": "scheduler.1",
"wred_profile": "AZURE_LOSSLESS"
},

Dependent pull requests:
#7752 - To modify platfrom files
#7281 - Yang model
Azure/sonic-utilities#1626 - DB migration
Azure/sonic-swss#1754 - swss change to remove ABNF format
2021-09-28 09:21:24 -07:00
Vaibhav Hemant Dixit
ee9250e8cc
Save DB dump after warm/fast reboot (#8803)
As a part of warmboot, redis database is dumped:
c97fe546e5/scripts/fast-reboot (L269)
However, this dump file is deleted, after it is loaded back into db post reboot.
The DB dump can be useful for debugging purpose, hence taking a backup of it can be useful.
Instead of deleting the dump, rename and keep the dump.
2021-09-23 23:53:22 -07:00
kellyyeh
62a1f5eb19
Add CLI Support for IPv6 Helpers and DHCPv6 Relay Counters (#8593) 2021-09-23 22:01:26 -07:00
abdosi
13ec43bc68
[baseimage]: Logrotate for wtmp and btmp files. (#8743)
Added logrotate file for wtmp and btmp to override default conf and set size cap as 100K as done in 
PR: #865. For buster this is control by separate file wtmp and btmp.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2021-09-15 23:28:27 -07:00
Sudharsan Dhamal Gopalarathnam
db529af203
Removing execute permission from copp config file (#8680)
*Removed execute permissions from the systemd copp-config.service file. 
Without this we will get a warning: "Configuration file /lib/systemd/system/copp-config.service is marked executable. Please remove executable permission bits. Proceeding anyway."
2021-09-13 09:10:21 -07:00
Ying Xie
41643a9729
[202012][fstrim] delay fstrim timer after sonic.target (#8737)
Why I did it
fstrim has dependency on pmon docker.

How I did it
start fstrim timer after sonic.target.

How to verify it
local test and PR test.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2021-09-13 07:37:46 -07:00
byu343
50a9587e6e
[gbsyncd] Flush GB_ASIC_DB for gbsyncd cold restart (#8633)
This is to flush the state in GB_ASIC_DB when running 'config reload'. Otherwise, the left state affects the cold restart of gbsyncd.
2021-08-31 15:52:48 -07:00
Samuel Angebault
57e7b941ab
[Arista] Fix flash size computation for Lodoga (#8622)
The Lodoga platform also matched crow which was hardcoding the flash
size to 3700. This change enables autodetect on Clearlake which in turns
allows autodetect for Lodoga.

The threshold was bumped from 3700 to 4000 because size computation can
differ slightly and report slightly above 3700.
2021-08-30 15:26:56 -07:00
Samuel Angebault
48ba459f9f
[Arista] Rely on automatic flash size detection for Lodoga (#8608)
Lodoga actually has a 8GB storage device.
LodogaSsd variant has a 30GB SSD drive.
However, in boot0 both were mishandled and assigned 4GB for legacy reasons.

Remove the hardcoding of the flash size and let boot0 autodetect the available space.
2021-08-26 19:02:10 -07:00
dflynn-Nokia
7bae388e2f
[Nokia ixs7215] Add support for changing the console baud rate (#8595)
This commit adds support for changing the default console baud rate configured
within the U-Boot bootloader. That default baud rate is exposed via the value
of the U-Boot 'baudrate' environment variable. This commit removes logic that
hardcoded the console baud rate to 115200 and instead ensures that the U-Boot
'baudrate' variable is always used when constructing the Linux kernel boot
arguments used when booting Sonic.

A change is also made to rc.local to ensure that the specified baud rate is set
correctly in the serial getty service.
2021-08-26 07:14:34 -07:00
byu343
cdfb4855dc
[macsec] Add eapol to copp config (#8416)
This change enables the control packets of MACsec to be processed by CPU.
2021-08-23 18:56:23 -07:00
Volodymyr Samotiy
e3a30deea9
[monit] Periodically monitor VNET route consistency (#8266)
*To run VNET route consistency check periodically.
*For any failure, the monit will raise alert based on return code.
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-08-19 16:29:25 -07:00
abdosi
2348794ef0
Enable sysctl fib_multipath_use_neigh (#8502)
Enable fib_multipath_use_neigh for v4
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Why I did:
This is helpful if the neighbor are not directly connected then Kernel forward to unreachable neighbor option. With this option forwarding using neighbor state to be valid.
2021-08-18 15:53:17 -07:00
Stephen Sun
c895677507
Use predefined macro as vendor information (#8361)
#### Why I did it
Use a predefined variable to get vendor information when the swss docker container is created

#### How I did it
Use `{{ sonic_asic_platform }}` instead of `$SONIC_CFGGEN -y /etc/sonic/sonic_version.yml -v asic_type`

#### How to verify it
Manually test.
2021-08-16 00:36:48 -07:00
Ying Xie
71e8b0caed
[aboot] use ram partition for /var/log for devices with 3.7G disks (#8400)
Master/202012 image size grew quite a bit. 3.7G harddrive can no longer hold one image and safely upgrade to another image. Every bit of harddrive space is precious to save now.

Also sh syntax seemingly changed, [ condition ] && action was a legit syntax in 201911 branch but it is an error when condition not met with 202012 or later images. Change the syntax to if statement to avoid the issue.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2021-08-13 09:01:34 -07:00
Vladyslav Morokhovych
80e0627acc [swss] Fix arp_update script (#8412)
Fix #7968

Issue is detected on SONiC.20201231.11

In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes.
It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured.
When issue is reproduced, stuck ping6 process is observed in swss container :

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         180  0.1  0.0   6296  1272 pts/0    S    17:03   0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1
And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process
2021-08-12 23:29:22 -07:00
Saikrishna Arcot
c8b5daed27 Upgrade to ifupdown2 3.0.0 with a patch to fix using broadcast addresses
In version 3.0.0, If a broadcast address is specified in
/etc/network/interfaces, then when ifup is run, it will fail with an
error saying `'str' object has no attribute 'packed'`. This appears to
be because it expects all attributes for an interface to be "packable"
into a compact binary representation. However, it doesn't actually
convert the broadcast address into an IPNetwork object (other addresses
are handled).

Therefore, convert the broadcast address it reads in from a str to an
IPNetwork object.

Also explicitly specify the scope of the loopback address in
/etc/network/interfaces as host scope. Otherwise, it will get added as
global scope by default. As part of this, use JSON to parse ip's output
instead of text, for robustness.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-08-12 23:18:01 -07:00
Stepan Blyshchak
14da7a1663
[sonic_debian_extension.j2] export DOCKER_HOST so that clients can use it to connect to dockerd (#8398)
Use DOCKER_HOST. Every client including docker command and python docker API uses this environment variable to connect to dockerd.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-08-10 11:11:45 -07:00
lguohan
cf73e22d52
[build]: add branch and release name in sonic_version.yml (#6356)
the branch refers the branch name that the commit is in,
for example master, 202012, 201911, ...
In case there is no branch, the name will be HEAD.

release is encoded in /etc/sonic/sonic_release file.
the file is only available for a release branch.
It is not available in master branch.

example for master branch
```
build_version: 'master.602-6efc0a88'
debian_version: '10.7'
kernel_version: '4.19.0-9-2-amd64'
asic_type: vs
commit_id: '6efc0a88'
branch: 'master'
release: 'none'
build_date: Tue Dec 29 06:54:02 UTC 2020
build_number: 602
built_by: johnar@jenkins-worker-23
```

example for 202012 release branch
```
build_version: '202012.602-6efc0a88'
debian_version: '10.7'
kernel_version: '4.19.0-9-2-amd64'
asic_type: vs
commit_id: '6efc0a88'
branch: '202012'
release: '202012'
build_date: Tue Dec 29 06:54:02 UTC 2020
build_number: 602
built_by: johnar@jenkins-worker-23
```

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-08-08 20:44:02 -07:00
Guohan Lu
0b155c003e [build]: Fix docker pull on armhf platform
armhf build uses native dockerd

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-08-06 23:33:40 -07:00
Longxiang Lyu
6283716e1d
[swss][arp_update] Send ipv6 pings over vlan sub interfaces (#8363)
#### Why I did it
* `arp_update` fails to ping those neighbors over vlan sub interfaces.

#### How I did it
* modify `arp_update_vars.j2` to get vlan sub interfaces with ipv6 addresses assigned.
* modify `arp_update` to send ipv6 pings over those retrieved vlan sub interfaces.

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2021-08-06 21:14:18 -07:00
byu343
2fccf0661a
[gearbox] Add gbsyncd container for Credo gearbox chips (#8144)
This change is to add a gbsyncd container to accommodate the syncd process and the SAI libraries for the Credo gearbox chips.

How I did it
This container works similar to the existing Broadcom syncd container. Its main difference is that the SAI-related dynamic libraries are replaced by the ones for Credo gearbox chips, and the container only reacts to SAI events for the gearbox chips. The SAI libraries will be provided by the package libsai-credo_1.0_amd64.deb.

For the image build, the added container will be built and included in the Broadcom platform image, after $(LIBSAI_CREDO)_URL = is replaced to the correct value. For now, as $(LIBSAI_CREDO)_URL is empty, the container build is skipped in the image build.

After the container is included in the image, in the runtime, the container will begin with checking the existence of /usr/share/sonic/hwsku/gearbox_config.json; if that file is not provided, the container will exit by itself. Therefore, for platforms unrelated to the Credo chips, as long as they are not providing the file, they will not be affected by this change.
2021-08-04 16:05:53 -07:00
VenkatCisco
d294807b6b
[baseimage]: add j2cli to sonic_debian_extension.j2 (#8019)
j2cli provides access to jinja library. cisco platform.py requires j2cli to handle jinja template configuration files.
2021-08-03 18:06:59 -07:00
vdahiya12
498422968f
[pmon] create and mount firmware directory on PMON for firmware upgrade support on muxcable (#8283)
This PR creates a directory firmware on the HOST with the path /usr/share/sonic/firmware, as well as this is 
mounted on PMON container with the same path /usr/share/sonic/firmware. This is required for firmware 
upgrade support for muxcable as currently by design all Y-Cable API's are called by xcvrd. As such if CLI has 
to transfer a file to PMON we need to mount a directory from host to PMON just for getting the firmware files. 
Hence we require this change.

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
2021-08-03 09:39:39 -07:00
mprabhu-nokia
3fd6e8d500
[systemd] ASIC status based service bringup on VOQ chassis (#7477)
Changes to allow starting per asic services like swss and syncd only if the platform vendor codedetects the asic is detected and notified. The systemd services ordering we want is database->database@->pmon->swss@->syncd@->teamd@->lldp@
There is also a requirement that management, telemetry, snmp dockers can start even if all asic services are not up.

Why I did it
For VOQ chassis, the fabric cards will have 1-N asics. Also, there could be multiple removable fabric cards. On the supervisor, swss and syncd containers need to be started only if the fabric-card is in Online state and respective asics are detected by the kernel. Using systemd, the dependent services can be in inactive state.

How I did it
Introduce a mechanism where all ASIC dependent service wait on its state to be published via PMON to REDIS. Once the subscription is received, the service proceeds to create respective dockers.
For fixed platforms, systemd is unchanged i.e. the service bring up and docker creation happens in the start()/ExecStartPre routine of the .sh scripts.
For VOQ chassis platform on supervisor, the service bringup skips docker creation in the start() routine, but does it in the wait()/ExecStart routine of the .sh scrips.
Management dockers are decoupled from ASIC docker creation.
2021-07-27 23:02:49 -07:00
賓少鈺
aa59bfeab7
[PDE]: introduce the SONiC Platform Development Env (#7510)
The PDE silicon test harness and platform test harness can be found in
src/sonic-platform-pdk-pde
2021-07-24 16:24:43 -07:00
Renuka Manavalan
3a96eb933e
Get Docker proxy info from config (#8205)
This helps not to hard code the docker proxy IP, but take it from config file during build time.
2021-07-19 21:17:47 -07:00
Renuka Manavalan
c5dff0c640
Revert "Revert "[Kubernetes]: The kube server could be used as http-proxy for docker (#7469)" (#8023)" (#8158)
This reverts commit 7236fa98e8.

Restore original PR #7469
2021-07-15 19:48:55 -07:00
Stepan Blyshchak
b3b6938fda
[dhcp-relay] make DHCP relay an extension (#6531)
- Why I did it
Make DHCP relay docker an extension. DHCP relay now carries dhcp relay commands CLI plugin and has a complete manifest.
It is installed as extension if INCLUDE_DHCP_REALY is set to y.

DEPENDS on #5939

- How I did it
Modify DHCP relay docker makefile and dockerfile. Make changes to sonic_debian_extension.j2 to install sonic packages.
I moved DHCP related CLI tests from sonic-utilities to DHCP relay docker.
This PR introduces a way to write a plugin as part of docker image and run the tests from cli-plugin-tests directory under docker directory.
The test result is available in target/docker-dhcp-relay.gz.log:

[ REASON ] :      target/docker-dhcp-relay.gz does not exist   NON-EXISTENT PREREQUISITES: docker-start target/docker-config-engine-buster.gz-load target/python-wheels/sonic_utilities-1.2-py3-none-any.whl-in
stall target/debs/buster/python3-swsscommon_1.0.0_amd64.deb-install
[ FLAGS  FILE    ] : []
[ FLAGS  DEPENDS ] : []
[ FLAGS  DIFF    ] : []
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sonic/dockers/docker-dhcp-relay/cli-plugin-tests, inifile:
plugins: cov-2.6.0
collecting ... collected 10 items

test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_plugin_registration PASSED [ 10%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_nonexist_vlanid PASSED [ 20%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_invalid_vlanid PASSED [ 30%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_invalid_ip PASSED [ 40%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_exist_ip PASSED [ 50%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_del_dhcp_relay_dest PASSED [ 60%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_remove_nonexist_dhcp_relay_dest PASSED [ 70%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_remove_dhcp_relay_dest_with_nonexist_vlanid PASSED [ 80%]
test_show_dhcp_relay.py::TestVlanDhcpRelay::test_plugin_registration PASSED [ 90%]
test_show_dhcp_relay.py::TestVlanDhcpRelay::test_dhcp_relay_column_output PASSED [100%]

=============================== warnings summary ===============================
/usr/local/lib/python3.7/dist-packages/tabulate.py:7
  /usr/local/lib/python3.7/dist-packages/tabulate.py:7: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import namedtuple, Iterable

-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================== 10 passed, 1 warnings in 0.35 seconds =====================
2021-07-15 10:35:56 -07:00
Blueve
3da6f12b0b
[port_config] Introduce ad-hoc mport_config.json file (#8066)
Signed-off-by: Jing Kan jika@microsoft.com
2021-07-15 08:56:35 +08:00
Stepan Blyshchak
3a2b8c6ba5
[SONiC Application Extension] support warm/fast reboot for extension packages (#7286)
#### Why I did it

I made this change to support warm/fast reboot for SONiC extension packages as per HLD Azure/SONiC#682.

#### How I did it

I extended manifest.json.j2 with new warm/fast reboot related fields and also extended sonic_debian_extension.j2 script template to generate the shutdown order files for warm and fast reboot.
2021-07-11 06:58:05 -07:00
Stepan Blyshchak
a294bfb03b
[sonic_debian_extension] fix packages.json generation and make the build fail when packages.json is not generated (#8044)
After https://github.com/Azure/sonic-buildimage/pull/7598 the packages.json generation is broken. This change fixes it make the whole build fail in case generation failed.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-07-09 12:29:33 -07:00
shlomibitton
776a446d76
[dhcp_relay] Disable dhcp_relay for ToRRouter switches type by the feature manager (#7789)
- Why I did it
Currently dhcp packets are disabled by the COPP manager for non ToRRouter type switches.
Even if the feature is enabled, DHCP packets wont hook to the CPU since the COPP manager will not trap this packets.
This change is to disable dhcp_relay by default for non ToRRouter switches from init_cfg.json.
With this approach, if the user want to enable the feature for non ToRRouter switches, manual enablement is required by the 'feature' configuration.
This is to keep the current approach for MSFT production issue with dhcp relay for non ToRRouter switched and allow the user to decide if to use it or not.

- How I did it
Configure dhcp_relay 'disabled' by default on init_cfg.json for non ToRRouter switches.
Remove the exclusion of dhcp packets on copp_cfg.json

- How to verify it
Enable dhcp_relay feature on a non ToRRouter switch.
Unit-tests modified so the default values on mocked CONFIG DB in 'test_vectors.py' for dhcp_relay will be 'disabled'.
This is by the change for 'init_cfg.json.j2'.
For ToRRouter the state will change from 'disabled' to 'enabled'.
Another test case added for a 'ToR' switch type, this is to test the state is 'enabled' if the user configured it to be so.
2021-07-08 09:10:46 +03:00
rajendra-dendukuri
f4b0c8fe4e
[kdump] Fix kdump error message when a reboot is issued (#7985)
dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead

The below error message is seen when a reboot is issued.

[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found
2021-07-01 11:52:38 -07:00
Samuel Angebault
17f0217f30
[Arista] Chassis device configurations (#7529)
Add configurations for the following chassis elements

Fabrics 7804R3-FM, 7808R3-FM and 7808R3A-FM
Linecard 7800R3-48CQ2
Supervisor 7800-SUP*
2021-06-30 18:16:20 -07:00
vganesan-nokia
ffca17da0b
[voqsystemlagid] Fix for timing issue in setting system lag id boundary (#7911)
The voq system lag id boundary is set in redis-chassis. Changes include
setting this from database-chassis container. This fixes a timing issue
in finding datbase_config.json file from redis directory which is
created from database container. Since database container usually
starts after database-chassis container the existence of this file is
unreliable while running the command. Running the command under
database-chassis container makes sure that the database_config.json form
redis-chassis directory is guaranteed to be available and hence fixes the
timing issue.

Signed-off-by: vedganes <vedavinayagam.ganesan@nokia.com>
2021-06-30 09:57:08 -07:00
Ying Xie
7236fa98e8
Revert "[Kubernetes]: The kube server could be used as http-proxy for docker (#7469)" (#8023)
This change causes nightly test to fail due to the fake proxy IP is not reachable.

Reverts #7469

This reverts commit f7ed82f44a.
2021-06-29 18:43:53 -07:00
Stepan Blyshchak
9de7e6860b
[sonic-app-ext] support app extensions installation during build (#7593)
Signed-off-by: Stepan Blyschak stepanb@mellanox.com

Why I did it
To support building DHCP relay as extension and installing it during build time.

How I did it
Created infrastructure. Users need to define their packages in rules/sonic-packages.mk

How to verify it
Together with #6531
2021-06-29 09:07:33 -07:00
Stepan Blyshchak
9ce7c6d9fe
[hostcfgd] Configure service auto-restart in hostcfgd. (#5744)
Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not.
If killed service auto restart mechanism restarts the container.
This change moves the logic from container to the host daemon - hostcfgd.
The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature.

hostcfgd refactoring - move feature handling in another class.
override systemd service Restart= setting from hostcfgd.
remove default systemd Restart=always.
Signed-off-by: Stepan Blyshchak stepanb@nvidia.com

- Why I did it

Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS.

- How I did it

hostcfgd configures 'Restart=' value for systemd service.

- How to verify it

root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled
root@r-tigon-11:/home/admin# show feature status | grep lldp
lldp            enabled   enabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 20 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 5 seconds                            lldp
root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 35 seconds                           lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 3 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 39 seconds ago                       lldp
root@r-tigon-11:/home/admin#
2021-06-29 09:06:21 -07:00
xumia
5c503b81ae
Fix vtysh shell-ingestion security issue (#7759)
Fix vtysh shell-ingestion security issue
Only expose the limited parameters of the command vtysh show.
2021-06-28 09:57:08 +08:00
Santhosh Kumar T
f8eb5b0958
Flashrom refactoring for broadcom platforms (#7693)
#### Why I did it
- To build flashrom properly with dependency tracking.

#### How I did it
- Moved flashrom code from platform/broadcom/sonic-platform-modules-dell/tools directory to src/flashrom directory.
- At the end, flashrom_0.9.7_amd64.deb package is build which will be installed in the devices.
- Currently flashrom builds only for Dell S6100 platforms.
2021-06-22 15:29:21 -07:00
judyjoseph
3ad830eb49
New sonic-buildimage images for Broadcom DNX ASIC family. (#7598)
Introduce new sonic-buildimage images for Broadcom DNX ASIC family.

sonic-broadcom-dnx.bin
sonic-aboot-broadcom-dnx.swi

How I did it

NO CHANGE to existing make commands

make init; make configure PLATFORM=broadcom;  make target/sonic-aboot-broadcom.swi; make  target/sonic-broadcom.bin

The difference now is that it will result in new broadcom images for DNX asic family as well. 

sonic-broadcom.bin, sonic-broadcom-dnx.bin
sonic-aboot-broadcom.swi, sonic-aboot-broadcom-dnx.swi

Note: This PR also adds support for Broadcom SAI 5.0 (based on 1.8 SAI ) for DNX based platform + changes in platform x86_64-arista_7280cr3_32p4 bcm config files and platform_env.conf files
2021-06-22 11:12:22 -07:00
Kebo Liu
078e0e0410
mount 'mellanox' folder only instead of create each sub folder (#7830)
#### Why I did it

Following the discussion in another PR https://github.com/Azure/sonic-buildimage/pull/7708#discussion_r642933510 , since there will be multi subfolders under **/var/log/mellanox**, so we agreed to only mount this folder and the subfolders will be created afterward on demand.  

#### How I did it

during the syncd docker creation, only mount  folder **/var/log/mellanox**

#### How to verify it

build an Mellanox image and verify the related folder on the host and docker side.
2021-06-22 06:27:56 -07:00
Sudharsan Dhamal Gopalarathnam
c88c3c7ba5
Grouping delayed services under a target for config reload checks (#7846)
#### Why I did it
Create a target for delayed service timers. Few services in sonic have delayed to speed up the bring up of the system and essential services. However there is no way to track when they start. This will be a problem when executing config reload as config reload expects all services to be up. Hence grouped all the timers that trigger the delayed services under one target so that they could be tracked in 'config reload' command

#### How I did it
Created delay.target service and add created dependency on the delayed targets.
2021-06-21 11:55:02 -07:00
Sujin Kang
ecc5073731
Support multiple pcie configuration file and change the pcie status table name to match with pcied changes (#7886)
Why I did it
Support multiple pcie configuration file and change the pcie status table name
This is to match with below two PRs.
Azure/sonic-platform-common#195
Azure/sonic-platform-daemons#189

How I did it
Check pcie configuration file with wild card and change the device status table name

How to verify it
Restart with changes and see if the pcie check works as expected.
2021-06-16 16:05:48 -07:00
Ann Pokora
3d629233bf
[MPLS][libnl3] libnl patches for supporting MPLS
* New accessors in libnl3 for MPLS attributes
* contains patch files for bug fixes in libnl3 for MPLS attribute parsing
2021-06-16 15:08:23 -07:00
Renuka Manavalan
f7ed82f44a
[Kubernetes]: The kube server could be used as http-proxy for docker (#7469)
Why I did it
The SONiC switches get their docker images from local repo, populated during install with container images pre-built into SONiC FW. With the introduction of kubernetes, new docker images available in remote repo could be deployed. This requires dockerd to be able to pull images from remote repo.

Depending on the Switch network domain & config, it may or may not be able to reach the remote repo. In the case where remote repo is unreachable, we could potentially make Kubernetes server to also act as http-proxy.

How I did it
When admin explicitly enables, the kubernetes-server could be configured as docker-proxy. But any update to docker-proxy has to be via service-conf file environment variable, implying a "service restart docker" is required. But restart of dockerd is vey expensive, as it would restarts all dockers, including database docker.

To avoid dockerd restart, pre-configure an http_proxy using an unused IP. When k8s server is enabled to act as http-proxy, an IP table entry would be created to direct all traffic to the configured-unused-proxy-ip to the kubernetes-master IP. This way any update to Kubernetes master config would be just manipulating IPTables, which will be transparent to all modules, until dockerd needs to download from remote repo.

How to verify it
Configure a switch such that image repo is unreachable
Pre-configure dockerd with http_proxy.conf using an unused IP (e.g. 172.16.1.1)
Update ctrmgrd.service to invoke ctrmgrd.py with "-p" option.
Configure a k8s server, and deploy an image for feature with set_owner="kube"
Check if switch could successfully download the image or not.
2021-06-16 07:46:01 -07:00
arlakshm
4d07bbbec6
[Yang][cfggen] update sonic-cfggen to generate config_db from Yang data (#7712)
Why I did it
This PR adds changes in sonic-config-engine to consume configuration data in SONiC Yang schema and generate config_db entries

How I did it
Add a new file sonic_yang_cfg_generator .
This file has the functions to

parse yang data json and convert them in config_db json format.
Validate the converted config_db entries to make sure all the dependencies and constraints are met.
Add a new option -Y to the sonic-cfggen command for this purpose

Add unit tests

This capability is support only in sonic-config-engine Python3 package only
2021-06-10 12:03:33 -07:00
yozhao101
1a3cab43ac
[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.

How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.

How to verify it
I verified this on the device str-7260cx3-acs-1.
2021-06-04 10:16:53 -07:00
Renuka Manavalan
73447efc31
Add service to restore TACACS from old config (#7560)
Why I did it
In upgrade scenarios, where config_db.json is not carry forwarded to new image, it could be left w/o TACACS credentials.
Added a service to trigger 5 minutes after boot and restore TACACS, if /etc/sonic/old_config/tacacs.json is present.

How I did it
By adding a service, that would fire 5 mins after boot.
This service apply tacacs if available.

How to verify it
Upgrade and watch status of tacacs.timer & tacacs.service
You may create /etc/sonic/old_config/tacacs.json, with updated credentials
(before 5mins after boot) and see that appears in config & persisted too.

Which release branch to backport (provide reason below if selected)
 201911
 202006
 202012
2021-06-03 20:07:17 -07:00
Andriy Kokhan
6931a45ecf
Fixed typos in config-setup (#7754)
Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com>
2021-06-03 08:59:38 -07:00
yozhao101
37863ac854
[Monit] Restart telemetry container if memory usage is beyond the threshold (#7645)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.

How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.

How to verify it
I verified this implementation on device str-7260cx3-acs-1.
2021-05-28 11:13:44 -07:00
Stepan Blyshchak
d7b96dfdf1
[sonic-sdk] add sonic sdk and sonic sdk buildenv (#6712)
- Why I did it

To give SONiC Application Extension developers an environment to run and develop their apps.

- How I did it
Created sonic-sdk and sonic-sdk-buildenv dockers and their dbg versions.

- How to verify it
Build:

$ make -f slave target/sonic-sdk.gz target/sonic-sdk-buildenv.gz
2021-05-28 10:16:02 -07:00
Renuka Manavalan
2cd61bc136
Invoke disk check periodically. (#7374)
Why I did it
Helps with periodic scan of disk for RO state.
If found, this script makes transient fix and raise error message.
2021-05-26 17:59:08 -07:00
Lawrence Lee
79914f5336
[swss.service]: Remove ordering with pmon (#7614)
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-05-26 09:12:54 -07:00
Prince Sunny
556a1dc9a8
[Mux] Do not clean-up HW_MUX_CABLE_TABLE from State DB (#7710)
Co-authored-by: Ubuntu <prsunny@prince-vm.vzw1i4tqyeburcdz5lrgulxi2c.yx.internal.cloudapp.net>
2021-05-26 09:12:34 -07:00
shlomibitton
9930e738fe
Remove 'vm.panic_on_oom=1' (#7678)
#### Why I did it
If a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer.
No panic occurs in this case, because other node's memory may be free.
This means system total status may be not fatal yet.

#### How I did it
Remove 'vm.panic_on_oom=1' kernel flag from 'vmcore-sysctl.conf '
2021-05-24 17:23:49 -07:00
Alexander Allen
da7533aad4
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB (#7586)
Why I did it
Currently, there is a bug in the ntp.conf jinja2 template where it will ignore the src_intf directive in CONFIG_DB if there are multiple IP addresses associated with an interface. This code change fixes that bug and allows the template to select the correct source interface for NTP.

How I did it
I did this by modifying the macro in ntp.conf.j2 which determines if there is an ip address associated with an interface to set a state variable when it detects a valid interface entry in CONFIG_DB instead of outputting "true" directly (which could result in multiple "trues" outputted for interfaces with multiple valid IP addresses).

How to verify it
Add two ipv4 addresses to an interface in SONiC

Add the following configuration to config_db.json

{
"NTP": {
    "global": {
        "src_intf": "Ethernet1"
        }
    }
}
Replace Ethernet1 with the interface name of the one you assigned the IP addresses to.

Run sudo config reload -y

Open /etc/ntp.conf and verify that the following line exists

...
interface listen Ethernet1
...
The interface specified should be the one set in the previous steps.

Description for the changelog
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB
2021-05-23 13:40:43 -07:00
Neetha John
3b06f44555
[qos]: modify dot1p to tc mapping (#7661)
Map priority 0 to TC 1 and priority 1 to TC 0

Send traffic on priority 0 and 1 and verified that it gets mapped correctly in hw

Signed-off-by: Neetha John <nejo@microsoft.com>
2021-05-20 10:36:39 -07:00
Sujin Kang
c6462577a9
add config-setup.service as dependency for pcie-check.service (#7599)
Why I did it
start pcie-check.service after config-setup.service since pcie_util depends on device_info which is available with config db metadata.

How I did it
Add config-setup.service as a dependency of pcie-check.service

How to verify it
Upon reboot, check if the pcie-check.sh throws the platform api error which is dependent on DEVICE_METADATA
2021-05-18 14:19:02 -07:00
Ze Gan
8f883fee67
[macsec]: Bind macsec service to sonic.target (#7642)
MACsec service cannot be enabled by "sudo config feature state macsec enabled"

Signed-off-by: Ze Gan <ganze718@gmail.com>
2021-05-18 11:44:21 -07:00
Renuka Manavalan
7a575b3d00
[container_checker] Use Feature table to get running containers (#7474)
Why I did it
Finding running containers through "docker ps" breaks when kubernetes deploys container, as the names are mangled.

How I did it
The data is is available from FEATURE table, which takes care of kubernetes deployment too.

How to verify it
Deploy a feature via kubernetes and don't expect error from container_check.
2021-05-07 08:42:15 -07:00
Nazarii Hnydyn
6e264d8ac9
[swss_vars]: Add 'resource_type' attribute. (#7526)
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-05-06 12:14:21 -07:00
Stepan Blyshchak
cd2c86eab6
[dockers] label SONiC Docker with manifest (#5939)
Signed-off-by: Stepan Blyschak stepanb@nvidia.com

This PR is part of SONiC Application Extension

Depends on #5938

- Why I did it
To provide an infrastructure change in order to support SONiC Application Extension feature.

- How I did it
Label every installable SONiC Docker with a minimal required manifest and auto-generate packages.json file based on
installed SONiC images.

- How to verify it
Build an image, execute the following command:

admin@sonic:~$ docker inspect docker-snmp:1.0.0 | jq '.[0].Config.Labels["com.azure.sonic.manifest"]' -r | jq
Cat /var/lib/sonic-package-manager/packages.json file to verify all dockers are listed there.
2021-04-26 13:51:50 -07:00
Guohan Lu
27a635a15a Revert "Flashrom refactoring (#6922)"
This reverts commit 7dd9d1f3f2.
2021-04-25 11:51:35 -07:00
xumia
56bdd750ab
Support readonly vtysh for sudoers (#7383)
Why I did it
Support readonly version of the command vtysh

How I did it
Check if the command starting with "show", and verify only contains single command in script.
2021-04-25 16:32:02 +08:00
a-barboza
ec9101f9c5
RADIUS Management User Authentication Feature (#7284)
Why I did it
HLD: https://github.com/Azure/SONiC/blob/master/doc/aaa/radius_authentication.md
CLI: In a separate PR.

How I did it
How to verify it
UT: src/sonic-host-services/tests/hostcfgd/hostcfgd_radius_test.py
2021-04-23 19:09:41 -07:00
Stepan Blyshchak
ae339c95d2
[systemd] disable default systemd udev rules for interfaces (#7369)
Fix #7364

99-default.link - was always in SONiC, but previous systemd (<247) had an issue and it did not work due to issue systemd/systemd#3374. Now systemd 247 works.

However, such policy overrides teamd provided mac address which causes teamd netdev to use a random mac
address. Therefore, needs to be disabled.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-04-21 17:50:00 -07:00
Santhosh Kumar T
7dd9d1f3f2
Flashrom refactoring (#6922)
#### Why I did it
To build flashrom properly with dependency tracking.

#### How I did it
Moved flashrom code from platform/broadcom/sonic-platform-modules-dell/tools directory to src/flashrom directory.
At the end, flashrom_0.9.7_amd64.deb package is build which will be installed in the devices.
2021-04-20 15:24:44 -07:00
Samuel Angebault
96690faa5b
[Arista] Fix dockerd issue on Arista platforms (#7376)
Why I did it
Recent systemd upgrade from #7228 requires an extra cmdline parameter for dockerd to start properly.
Updating boot0 was missed as part of the systemd upgrade change.

How I did it
Just added the missing cmdline parameter in files/Aboot/boot0.j2
This change fixes #7372

How to verify it
Boot the image and dockerd should start normally.
2021-04-20 14:55:14 -07:00
Kuanyu Chen
01f2b5f250
[config-setup]: Fix a bug in checking if updategraph is enabled (#7093)
Encounter error during "config-setup boot" if the updategraph is enabled.

How I did it
Correct the code inside the config-setup script.
Remove the space between the assignment operator.

How to verify it
Remove the /etc/sonic/config_db.json and reboot the device.
Originally, it will return following error after boot up.
rv: command not found
After modification, it can correctly parse the status of updategraph without error.
2021-04-19 11:40:52 -07:00
guxianghong
6fe6d7394d
[arm] support compile sonic arm image on arm server (#7285)
- Support compile sonic arm image on arm server. If arm image compiling is executed on arm server instead of using qemu mode on x86 server, compile time can be saved significantly.
- Add kernel argument systemd.unified_cgroup_hierarchy=0 for upgrade systemd to version 247, according to #7228
- rename multiarch docker to sonic-slave-${distro}-march-${arch}

Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
Co-authored-by: Shi Lei <shil@centecnetworks.com>
2021-04-18 08:17:57 -07:00
Stepan Blyshchak
4369361894
[sonic_debian_extension.j2] fix systemd version not from buster-backports (#7322)
Install systemd explicitelly from backports and install libsystemd* packages from backports.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-04-18 08:07:02 -07:00
jmmikkel
43342b33b8
[chassis] Add templates and code to support VoQ chassis iBGP peers (#5622)
This commit has following changes:

* Add templates and code to support VoQ chassis iBGP peers

* Add support to convert a new VoQChassisInternal element in the
   BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR 
   table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
  BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
  element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
  value is "false".
* Add a set of test cases for the new voq_chassis templates in
  sonic-bgpcfgd tests.

Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.

Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
2021-04-16 11:11:32 -07:00
yozhao101
2737c9681f
[container_checker] Exclude the 'always_disabled' container from expected running container list (#7217)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Since we introduced a new value always_disabled for the state field in FEATURE table, the expected running container list
should exclude the always_diabled containers. This bug was found by nightly test and posted at here: issue. This PR fixes #7210.

How I did it
I added a logic condition to decide whether the value of state field of a container was always_disabled or not.

How to verify it
I verified this on the device str-dx010-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
 201911
 202006
[ x] 202012
2021-04-02 08:05:46 -07:00
vganesan-nokia
b313d4d092
[systemlag] Lag id boundary set for system lag (#6488)
Signed-off-by: vedganes <vedavinayagam.ganesan@nokia.com>

Changes for setting platfrom specific lag id boundary id in the chassis
app db. The platfrom specific lag id boundaries are supplied via
chassisdb.conf. The lag_id_start and lag_id_end boundary values sourced
from this file are set in chassis app db which will be used by lag id
allocator to allocate unique lag id in atomic fashion
2021-03-30 23:21:53 -07:00
Stepan Blyshchak
1f7d9e2698
[docker_img_ctl.j2] make tmpfs mounts optional and add ability to run container by image id (#6439)
- Why I did it
I made the docker_img_ctl.j2 applicable for more dockers (including application extensions dockers) by adding an option not to mount tmpfs on /tmp/ and /var/tmp/. In some applications /tmp/ is a different docker volume which can't be tmpfs.
Also, I added and ability to pass REPO[:TAG]|[@digest]/IMAGE_ID instead of just REPO name.

- How I did it
Modified docker_img_ctl.j2 and docker makefiles.

- How to verify it
Run it on the switch.
2021-03-16 17:03:12 +02:00
Stepan Blyshchak
2b8941e716
[sonic_debian_extension] add docker script to SONiC filesystem (#5935)
- Why I did it
To allow SONiC Package Migration during SONiC-2-SONiC upgrade we need to start docker daemon in chroot-ed environment in new SONiC filesystem.
Later this script will be used to start dockerd in chroot environment on SONiC

- How I did it
Install a docker service script into /usr/lib/docker/ in SONiC filesystem.

- How to verify it
Install SONiC image on the switch, mount squashfs to some directory, mount overlay rw layer over squashfs, mount procfs and sysfs, mount docker library. Start the docker using:
root@sonic:~$ /usr/lib/docker/docker.sh start

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-03-14 14:15:42 +02:00
Tamer Ahmed
51ab39fcb2
[hostcfgd]: Add Ability To Configure Feature During Run-time (#6700)
Features may be enabled/disabled for the same topology based on run-time
configuration. This PR adds the ability to enable/disable feature based
on config db data.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-03-13 05:56:27 -08:00
Renuka Manavalan
6f7cd8d772
Copy dummy flannel.conf to get around absence of CNI Network (#6985)
Why I did it
We skip install of CNI plugin, as we don't need. But this leaves node in "not ready" state, upon joining master.
To fix, we copy this dummy .conf file in /etc/cni/net.d

How I did it
Keep this file in /usr/share/sonic/templates and copy to /etc/cni/net.d upon joining k8s master.

How to verify it
Upon configuring master-IP and enable join, watch node join and move to ready state.
You may verify using kubectl get nodes command
2021-03-09 19:49:54 -08:00
yozhao101
21f5e1280d
[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord. (#6849)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
2021-02-25 14:35:29 -08:00
Stepan Blyshchak
12c03c4f25
[sonic_debian_exntesion] install docker_image_ctl.j2 template in the image templates (#5937)
SONiC Package Manager will require to auto-generate the start script using that template. For that, we need this template to be recorded in SONiC filesystem.

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-02-25 09:11:12 -08:00
Stepan Blyshchak
e179ec2fae
[services] introduce sonic.target (#5705)
- Why I did it
Group all SONiC services together and able to manage them together. Will be used in config reload command as much simpler and generic way to restart services.

- How I did it
Add services to sonic.target

- How to verify it
Together with Azure/sonic-utilities#1199
config reload -y

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-02-25 14:26:24 +02:00
Ze Gan
4068944202
[MACsec]: Set MACsec feature to be auto-start (#6678)
1. Add supervisord as the entrypoint of docker-macsec
2. Add wpa_supplicant conf into docker-macsec
3. Set the macsecmgrd as the critical_process
4. Configure supervisor to monitor macsecmgrd
5. Set macsec in the features list
6. Add config variable `INCLUDE_MACSEC`
7. Add macsec.service

**- How to verify it**

Change the `/etc/sonic/config_db.json` as follow
```
{
    "PORT": {
        "Ethernet0": {
            ...
            "macsec": "test"
         }
    }
    ...
    "MACSEC_PROFILE": {
        "test": {
            "priority": 64,
            "cipher_suite": "GCM-AES-128",
            "primary_cak": "0123456789ABCDEF0123456789ABCDEF",
            "primary_ckn": "6162636465666768696A6B6C6D6E6F707172737475767778797A303132333435",
            "policy": "security"
        }
    }
}
```
To execute `sudo config reload -y`, We should find the following new items were inserted in app_db of redis
```
127.0.0.1:6379> keys *MAC*
1) "MACSEC_EGRESS_SC_TABLE:Ethernet0:72152375678227538"
2) "MACSEC_PORT_TABLE:Ethernet0"
127.0.0.1:6379> hgetall "MACSEC_EGRESS_SC_TABLE:Ethernet0:72152375678227538"
1) "ssci"
2) ""
3) "encoding_an"
4) "0"
127.0.0.1:6379> hgetall "MACSEC_PORT_TABLE:Ethernet0"
 1) "enable"
 2) "false"
 3) "cipher_suite"
 4) "GCM-AES-128"
 5) "enable_protect"
 6) "true"
 7) "enable_encrypt"
 8) "true"
 9) "enable_replay_protect"
10) "false"
11) "replay_window"
12) "0"
```

Signed-off-by: Ze Gan <ganze718@gmail.com>
2021-02-23 13:22:45 -08:00
arlakshm
f77157f09d
[baseimage] add ipintutil in sudoer file (#6845)
show ip interfaces is enhanced recently to support multi ASIC platforms in this PR- https://github.com/Azure/sonic-utilities/pull/1396 .
The ipintutil script as to run as sudo user, to get the ip interface from each namespace.
Add this script to the sudoer file so that show ip interface command is available for user with read-only permissions

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-02-22 23:34:28 -08:00
Sujin Kang
d5238ae8dd
[pcie.yaml] Move pcie configuration file path to platform directory (#6475)
- Why I did it
The pcie configuration file location is under plugin directory not under platform directory.
#6437

- How I did it

Move all pcie.yaml configuration file from plugin to platform directory.
Remove unnecessary timer to start pcie-check.service
Move pcie-check.service to sonic-host-services
- How to verify it
Verify on the device
2021-02-21 08:27:37 -08:00
Samuel Angebault
5fb374b03d
[Arista] Driver and platform update (#6468)
- Add support for `DCS-7050SX3-48YC8` and `DCS-7050SX3-48C8` platform
 - Add support for more variants of `DCS-7280CR3-32[PD]4`
 - Add Supervisor to Linecard consutil support
 - Complete Watchdog platform API support
 - Fix some PSU behavior on `DCS-7050QX-32` and `DCS-7060CX-32S`
 - Fix SEU management on `DCS-7060CX-32S`
 - Allow kernel modules to build up to linux 5.10
 - Rename led color `orange` to `amber`
 - Miscellaneous fixes
2021-02-19 10:48:52 -08:00
SuvarnaMeenakshi
5a49a0f499
[multi-asic][vs]: Update topology script to retrieve hwsku from minigraph (#6219)
Update topology script to retrieve hwsku from minigraph
if hwsku information is not available in config_db.
Fix clean up of interfaces in msft_multi_asic_vs hwsku
topology script.
- Why I did it
When bringing up multi-asic VS switch, topology service is started during boot up.
Topology service starts a shell script which runs the topology script present in /usr/share/sonic/device// directory. To invoke hwsku specific script, the topology script tries to retrieve hwsku information from config_db.
During initial boot up config_db might not be populated. In order to start topology service before config_db is updated,
update topology script to get hwsku information from minigraph.xml if it is available.
This will be helpful to bring up multi-asic VS testbed by loading minigraph and starting topology service.
- How I did it
Update topology.sh script to retrieve hwsku information from minigraph.xml.
Fix clean up function on msft_multi_asic_vs toplogy script.
- How to verify it
single-asic VS - no change; topology service is only enabled for multi-asic VS.
multi-asic VS - Bring up multi-asic VS image, copy minigraph to vs image, start topology service. Topology service should be successful.
to test clean up function fix, start topology service - make sure interfaces are created and moved to the right namespaces.
stop topology service - make sure namespace do not have any interface and all front end interfaces are present in default namespace.
2021-02-18 22:02:29 -08:00
xumia
2ef5bd2e90
Add mirrors for reproducible build (#6813) 2021-02-18 14:59:52 +08:00
shlomibitton
f6bee7306e
Stop teamd service before syncd (#6755)
- What I did
All SWSS dependent services should stop before SWSS service to avoid future possible issues.
For example 'teamd' service will stop before to allow the driver unload netdev gracefully.
This is to stop all LAG's before restarting syncd service when running 'config reload' command.

- How I did it
Change the order of dependent services of SWSS.

- How to verify it
Run 'config reload' command.
Previously the operation failed when a large number of PortChannel configured on the system.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-02-15 16:05:34 +02:00
Lawrence Lee
97c605f1f7
[swss]: Clear MUX-related state DB tables on start (#6759)
* Add *MUX_CABLE_TABLE* to set of tables to clear on SWSS start, which
will clear HW_MUX_CABLE_TABLE and MUX_CABLE_TABLE
* Order swss to start before pmon to ensure that DBs are cleared before
xcvrd (running inside pmon) starts and re-populates the tables

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-02-14 12:43:49 -08:00
dflynn-Nokia
88961f1339
[armhf build] Fix azure-storage dependency on cryptography package (#6780)
Fix marvell-armhf build break

The azure-storage package depends on the cryptography package. Newer
versions of cryptography require the rust compiler, the correct version
for which is not readily available in buster. Hence we pre-install an
older version here to satisfy the azure-storage dependency.
Note: This is not a problem for other architectures as pre-built versions
of cryptography are available for those. This sequence can be removed
after upgrading to debian bullseye.
2021-02-14 10:36:04 -08:00
Lior Avramov
6f8c31554f
[systemd] Increase syncd startup script timeout to support FW upgrade on init. (#6709)
**- Why I did it**
To support FW upgrade on init.

**- How I did it**
Change timeout value

**- How to verify it**
I manually changed ASIC and Gearbox FW followed by hard reset in order for FW upgrade to take place on init.

Signed-off-by: liora <liora@nvidia.com>
2021-02-11 12:53:36 +02:00
Arun Saravanan Balachandran
3015de1dd0
[sonic-host-service] Move to sonic-host-services package (#6273)
- Why I did it

To move ‘sonic-host-service’ which is currently built as a separate package to ‘sonic-host-services' package. 

- How I did it

- Moved 'sonic-host-server' to 'src/sonic-host-services' and included it as part of the python3 wheel.
- Other files were moved to 'src/sonic-host-services-data' and included as part of the deb package.
- Changed build option ‘INCLUDE_HOST_SERVICE’ to ‘ENABLE_HOST_SERVICE_ON_START’ for enabling sonic-hostservice at boot-up by default.
2021-02-08 19:35:08 -08:00
SuvarnaMeenakshi
62a599a5b3
[multi_asic][vs]: Add dependency in teamd service to start after topology service(#6594)
[multi_asic][vs]: Add dependency in teamd service to start after topology service.
- Why I did it
In multi-asic VS, topology service is run after database service to set up the internal asic topology.
swss and syncd have a dependency to start after topology service is run so that the interfaces are moved to right namespace and created in the right namespace. In case of multi-asic vs, during the initial boot up, when there is no configuration added, teamd service starts and swss/syncd do not start as topology service does not start. Upon loading configuration using config_db or minigraph, swss and sycnd start up , but teamd is not restarted as swss is not stopped and started. This causes teamd to be in a bad state and requires a reload of config.

- How I did it
Add dependency in teamd service to start after topology service is completed.

- How to verify it
No change in single asic vs or platform.
No change in multi-asic regular image.
Change only in multi-asic VS. Bring up a multi-asic VS image without any configration, teamd service will fail to start due to dependency failure. Load minigraph, start topology service, load configuration, ensure all services come up.
Signed-off-by: SuvarnaMeenakshi <sumeenak@microsoft.com>
2021-02-04 14:10:56 -08:00
Joe LeVeque
820d350301
[pcie-check] Update underlying pcieutil command and add to sudoers file (#6682)
- Why I did it

As of Azure/sonic-utilities#1297, subcommands of pcieutil have changed to remove the redundant pcie- prefix. This PR adapts calling applications (pcie-check) to the new syntax.

Resolves #6676

- How I did it

Remove pcie- prefix from pcieutil subcommands in calling applications
Also add pcieutil * to sudoers file, as pcieutil requires elevated permissions
2021-02-04 12:14:08 -08:00
Guohan Lu
3f2a39d583 [proc-exit-listener]: fix syntax error
the bug is introduced in commit 34cca20c

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-02-02 03:58:20 -08:00
Samuel Angebault
0c4d4ace76
[kdump] Fix OOM events in crashkernel (#6447)
A few issues where discovered with crashkernel on Arista platforms.

1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.

2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.

3) The secureboot path allowlist would remove kernel crash reports.

4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`

**- How I did it**

1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.

2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.

3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`

4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.

**- How to verify it**

Regular boot:
 - enable kdump
 - enable docker_inram=on via kernel-params
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness OOM events on the console
 - after: crash kernel works and crash available under /var/crash

Secure boot:
 - enable kdump
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness no kdump
 - after: crash kernel works and crash available under /var/crash


Co-authored-by: Boyang Yu <byu@arista.com>
2021-02-02 01:55:09 -08:00
arlakshm
b5225407ef
[baseimage]: add docker ps to the sudoer file (#6604)
fixes Azure/sonic-utilities#1389

With the recent changes in sudoer files. The  show commands fails for the read-only users. 
The problem here is the 'docker ps' is failing in the function [get_routing_stack()](8a1109ed30/show/main.py (L54)) therefore all the CLI commands are failing.

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-01-29 08:16:32 -08:00
arlakshm
ff8cc49b18
[multi asic] add ip netns identify command to sudoer (#6591)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>

- Why I did it
The command sudo ip netns identify <pid> is used in function get_current_namespace
to check in the cli command is running in host context or within a namespace.

This function is used for every CLI command and command sudo ip netns identify <pid> needs to be added in sudoer files to allow users with RO access to run show cli commands

This problem is not there on single asic platforms.

- How I did it
Add ip netns identify [0-9]* to sudoers file.
2021-01-28 23:12:01 -08:00
Guohan Lu
34cca20cb6 [proc-exit-listener]: ignore blank lines
make proc-exit-listener more rebust

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 19:41:59 -08:00
abdosi
cfa8fbbf1a
[baseimage]: Updates for Ebtables and support for multi-asic (#6542)
Following changes were done for ebtables:

- Support for Multi-asic platforms. Ebtable filters are installed in namespace for multi-asic and not host. On Single asic installed on  host.

- For Multi-asic platforms we don't want to install on host otherwise Namespace-to-Namespace communication does not happens since ARP Request are not forwarded.

- Updated to use text file to restore ebtables rules then the binary format. Rules are restore as part of Database docker init instead of rc.local

- Removed the ebtable service files for buster as not needed as filters are restored/installed as part of database docker init.
   All the binaries are pre-installed with ebtables* binary are same as ebatbles-legacy-* 

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2021-01-27 08:36:10 -08:00
judyjoseph
46b3bd5503
[teamd]: Increase wait timeout for teamd docker stop to clean Port channels. (#6537)
The Portchannels were not getting cleaned up as the cleanup activity was taking more than 10 secs which is default docker timeout after which a SIGKILL will be send.
Fixes #6199
To check if it works out for this issue in 201911 ? #6503

This issue is significantly seen in master branch compared to 201911 because the Portchannel cleanup takes more time in master. Test on a DUT with 8 Port Channels.

master

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m15.599s
    user    0m0.061s
    sys     0m0.038s
Sonic 201911.v58

    admin@str-s6000-acs-8:~$ time sudo systemctl stop teamd
    real    0m5.541s
    user    0m0.020s
    sys     0m0.028s
2021-01-23 20:57:52 -08:00
arlakshm
0e12ca81c7
[Multi Asic] support of swss.rec and sairedis.rec for multi asic (#6310)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com

- Why I did it
This PR has the changes to support having different swss.rec and sairedis.rec for each asic.
The logrotate script is updated as well

- How I did it

Update the orchagent.sh script to use the logfile name options in these PRs(Azure/sonic-swss#1546 and Azure/sonic-sairedis#747)
In multi asic platforms the record files will be different for each asic, with the format swss.asic{x}.rec and sairedis.asic{x}.rec

Update the logrotate script for multiasic platform .
2021-01-22 09:42:19 -08:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00
Qi Luo
25e4d773b9
[baseimage]: Cleanup sudoers file (#6518) 2021-01-21 08:28:32 -08:00
Ying Xie
054f5b7a53
[warm boot finalizer] only wait for enabled components to reconcile (#6454)
* [warm boot finalizer] only wait for enabled components to reconcile

Define the component with its associated service. Only wait for components that have associated service enabled to reconcile during warm reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2021-01-15 07:48:11 -08:00
yozhao101
04cd1d61e8
[Monit] Monitoring the running status of containers. (#6251)
**- Why I did it**
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually. 

**- How I did it**
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

**- How to verify it**
I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2021-01-07 19:52:22 -08:00