Why I did it
Recently the nightly testing pipeline found that the autorestart test case was failed when it was run against master image. The reason is Restart= field in each container's systemd configuration file was set to Restart=no even the value of auto_restart field in FEATURE table of CONFIG_DB is enabled.
This issue introduced by #10168 can be reproduced by the following steps:
Issues the config command to disable the auto-restart feature of a container
Runs command config reload or config reload minigraph to enable auto-restart of the container
Checks Restart= field in the container's systemd config file mentioned in step 1 by running the command
sudo systemctl cat <container_name>.service
Initially this PR (#10168) wants to revert the changes proposed by this: #8861. However, it did not fully revert all the changes.
How I did it
When hostcfgd started or was restarted, the Restart= field in each container's systemd configuration file should be initialized according to the value of auto_restart field in FEATURE table of CONFIG_DB.
How to verify it
I verified this change by running auto-restart test case against newly built master image and also ran the unittest:
Signed-off-by: bingwang <bingwang@microsoft.com>
Why I did it
This PR is to add two extra lossless queues for bounced back traffic.
HLD sonic-net/SONiC#950
SKUs include
Arista-7050CX3-32S-C32
Arista-7050CX3-32S-D48C8
Arista-7260CX3-D108C8
Arista-7260CX3-C64
Arista-7260CX3-Q64
How I did it
Update the buffers.json.j2 template and buffers_config.j2 template to generate new BUFFER_QUEUE table.
For T1 devices, queue 2 and queue 6 are set as lossless queues on T0 facing ports.
For T0 devices, queue 2 and queue 6 are set as lossless queues on T1 facing ports.
Queue 7 is added as a new lossy queue as DSCP 48 is mapped to TC 7, and then mapped into Queue 7
How to verify it
Verified by UT
Verified by coping the new template and generate buffer config with sonic-cfggen
#### Why I did it
Fix sonic-db-cli high CPU usage on SONiC startup issue: https://github.com/Azure/sonic-buildimage/issues/10218
ETA of this issue will be 2022/05/31
#### How I did it
Re-write sonic-cli with c++ in sonic-swss-common: https://github.com/Azure/sonic-swss-common/pull/607
Modify swss-common rules and slave.mk to install c++ version sonic-db-cli.
#### How to verify it
Pass all E2E test scenario.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
#### Description for the changelog
Build and install c++ version sonic-db-cli from swss-common.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->
#### A picture of a cute animal (not mandatory but encouraged)
#### Why I did it
No logs currently exist for sonic-save-X containers which makes it difficult to debug.
#### How I did it
Altered Makefile.work to create logs in the sonic-slave-X folder while still displaying the log to the screen to prevent interfering with any existing tooling.
#### How to verify it
Do `make configure` and verify that logs show up in `sonic-slave-buster/` and `sonic-slave-bullseye/`
#### Description for the changelog
Add logging for slave container builds
#### A picture of a cute animal (not mandatory but encouraged)
TBD
* When reloading config after crashes, VTEP interfaces are sometimes not created since the tunnel still exists in the STATE_DB.
* Adding VXLAN_TUNNEL_TABLE to the list of tables to be cleaned in swss.sh fixes the problem.
Currently, the build with ASAN_ENABLE=y reuses the packages built with
ASAN_ENABLE=n (and vice versa). To address this issue, ASAN_ENABLE is added to DEP_FLAGS for asan-enabled packages (docker-syncd-mlnx, syncd, docker-orchagent, swss).
- Why I did it
To make dpkg cache use/rebuild the packages for ASAN_ENABLE=y/n.
- How I did it
Added ASAN_ENABLE to the DEP_FLAGS for asan-enabled packages.
- How to verify it
Built with ASAN_ENABLE=y/n and checked the .flags .log files.
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
This is to improve the readability of ASAN reports. The debug package adds function names and source code references to the backtrace (currently, there are only binary addresses of functions)
Another way to address this issue is to build the image with "INSTALL_DEBUG_TOOLS=y". The downside of this approach is that the image size and compilation time are unnecessarily big. Also, the idea is to make the "ENABLE_ASAN" self-sufficient, which would not be the case for this approach.
- Why I did it
To improve the readability of asan logs.
- How I did it
Added SYNCD_DBG and SWSS_DBG to corresponding docker images for ASAN_ENABLE=y build
- How to verify it
Add artificial memory leak
Build with ASAN_ENABLE=y
Test the image and check the ASAN report
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
* Removed unused default_config.json
* Remove asic.conf file from HW SKUs directories as they are not used by upstream code
* Enable dynamic PCI ID identification on Otterlake2
Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
Why I did it
Fix the target directory not empty issue when publishing artifacts.
Some of the artifacts are published to $(Build.ArtifactStagingDirectory)/target/ before source code checked out.
- Why I did it
Yang Model about password hardening feature, the sonic CLI of this feature was autogenerated from this Yang model
- How I did it
Create new Yang model in src/sonic-yang-models/yang-models/sonic-passwh.yang.
- How to verify it
There are unitests(yang test) in this P.R covering all the passwords policies with good and bad values cases.
Or is possible manually using the config/show password commands that were autogenerated from this Yang model. (this CLI code added in sonic-utilities)
- Why I did it
Update MFT to newer version
- How I did it
Update MFT_VERSION in platform/mellanox/mft.mk
- How to verify it
Check version via dpkg -l | grep mft
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
Official build fails complaining missing below targets:
2022-05-25T10:50:38.0560306Z tar: target/debs/buster/libyang2-cpp1_2.0.112-6_amd64.deb: Cannot stat: No such file or directory
2022-05-25T10:50:38.0571392Z tar: target/debs/buster/libyang2-cpp-dev_2.0.112-6_amd64.deb: Cannot stat: No such file or directory
2022-05-25T10:50:38.0588893Z tar: target/debs/buster/libyang2-cpp1-dbgsym_2.0.112-6_amd64.deb: Cannot stat: No such file or directory
2022-05-25T10:50:38.0590887Z tar: target/debs/buster/yang-tools_2.0.112-6_all.deb: Cannot stat: No such file or directory
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
There was a typo in hwsku specified as part of #10889
How I did it
Replaced with the correct hwsku
How to verify it
test_cfggen.py is passing
Updating sonic-utilities sub module with the following commits
84dbd93 Clear the fvs vector before popping the message out of notification
a90b2b7 selectabletimer: add mutex to start() and stop()
7ae22be Fix SIGTERM can't terminate PubSub::listen issue
#### Why I did it
To fix hostcfgd can't terminate by sigint issue, need update sonic-swss-common submodule.
#### How I did it
#### How to verify it
#### Which release branch to backport (provide reason below if selected)
#### Description for the changelog
84dbd93 Clear the fvs vector before popping the message out of notification
a90b2b7 selectabletimer: add mutex to start() and stop()
7ae22be Fix SIGTERM can't terminate PubSub::listen issue
#### A picture of a cute animal (not mandatory but encouraged)
#### Why I did it
For yang model, sample_config_db.json file was missing Sample data for the features SNAT/DNAT/IPMC
#### How I did it
Added the SNAT,DNAT,IPMC(low Threshold/high threshold/threshold_type)entries in CRM table.
#### How to verify it
With sanity Build/test only.
[muxorch] Handling optional attributes in muxorch (#2288)
Update netlink messages handler (#2233)
Broadcast Unknown-multicast and Unknown-unicast Storm-control (#1306)
[vstest]: Increase PollingConfig default timeout (#2285)
[FDB] Fix fbdorch to properly handle syncd FDB FLUSH Notif (#2254)
[macsecorch]: Support for non-default sa per sc (#2250)
Migrating the NAT vs tests from Click to direct DB access (#2278)
[neighsync] Ignoring IPv4 link local addresses (#2260)
[IntfMgrd] Retry adding ipv6 prefix by setting disabled_ipv6 flag (#2267)
Increase Redis Timeout value for Switch Create Opration for Packet (#2243)
Update fdborch.cpp (#2261)
Signed-off-by: dprital <drorp@nvidia.com>
- Why I did it
With SAI V.1.10.2, new interface types CR8/SR8/KR8/LR8 have been introduced, we should also support them from the CLI configuration.
- How I did it
Add new enum for the new interface types
- How to verify it
Run the "config interface type" command to verify new interface types can be accepted and handled correctly.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
It is to improve the build performance, when building multiple targets.
The modified time of downloaded files should be not older than the file .platform.
If not, the file will be downloaded again, when building any dependent targets.
How I did it
When downloading the packages from web site, the modified time will be changed by the command "touch".
This reverts commit 15cf9b0d70.
Why I did it
Revert the PR #10775, for it has impact on onie installation.
It is caused by the symbol links not supported in some of the onie unzip.
We will enable after fixing the issue, see #10914
#### Why I did it
To ensure that some internal testcases do not break due to external changes
#### How to verify it
Ran test_cfggen.py with the changes and it passed
- Why I did it
YANG schema is missing for sonic-telemetry
- How I did it
Added YANG schema to sonic-yang-models and appropriate unit tests inside of test and test_config
- How to verify it
Build sonic-yang-models python wheels target and verify that unit tests are passing
What/Why I did:
Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability.
Scenario: For example if let's say database service goes in fail state then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.
Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.
Issue 2: Use of IPVLAN vs MACVLAN
Currently we are using ipvlan mode. However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device." based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN) it remains in Kernel Database and never goes to free pool even though namespace is deleted.
Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.
Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis Server.
Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle
Solution: Added explicit PING from namespace that will check this reachability.
Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.
Solution: Handle gracefully via try..except and log the message.
Why I did it
Upgrade FRR to version 8.2.2. Build libyang2 required by FRR.
How I did it
Update FRR version and tag.
How to verify it
Following tests were performed on sonic-vs:
BGP docker status check
BGP configuration and session establishment
Route redistribution and ping
Issued show commands to check the bgp neighbor and routes
Checked app-db to ensure bgp routes are installed with correct interface and nexthop.
Create VRF and check FRR knows the VRF
Check VRF routes are installed in app-db with correct Vrf name and next-hop
Establish BGP Evpn session and check if Evpn routes (multicast, mac, prefix) are exchanged and installed correctly in app-db.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com
Why I did it
resolves#10761.
For VOQ chassis, the Recirc port, which was added for the Everflow, stays admin down after load minigraph.
This PR add the fix to make the recirc port as admin up
How I did it
The PR adds a change in minigraph.py, if port has role as Rec make the the port as admin-status up.
How to verify it
UT
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
Why I did it
There is a bug that the Port attributes in CONFIG_DB will be cleared if using sudo config macsec port add Ethernet0 or sudo config macsec port del Ethernet0
How I did it
To fetch the port attributes before set/remove MACsec field in port table.
Signed-off-by: Ze Gan <ganze718@gmail.com>
Why I did it
To further add cable_type and soc_ipv4 field to table MUX_CABLE, this PR tries to parse the minigraph like the following:
```
<Device i:type="SmartCable">
<ElementType>SmartCable</ElementType>
<SubType>active-active</SubType>
<Address xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>192.168.0.3/21</d5p1:IPPrefix>
</Address>
<AddressV6 xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>::/0</d5p1:IPPrefix>
</AddressV6>
<ManagementAddress xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>0.0.0.0/0</d5p1:IPPrefix>
</ManagementAddress>
<ManagementAddressV6 xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>::/0</d5p1:IPPrefix>
</ManagementAddressV6>
<SerialNumber i:nil="true" />
<Hostname>svcstr-7050-acs-1-Servers0-SC</Hostname>
</Device>
<Device i:type="Server">
<ElementType>Server</ElementType>
<Address xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>192.168.0.2/21</d5p1:IPPrefix>
</Address>
<AddressV6 xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>fc02:1000::2/64</d5p1:IPPrefix>
</AddressV6>
<ManagementAddress xmlns:d5p1="Microsoft.Search.Autopilot.NetMux">
<d5p1:IPPrefix>0.0.0.0/0</d5p1:IPPrefix>
</ManagementAddress>
<Hostname>Servers0</Hostname>
</Device>
```
Signed-off-by: Longxiang Lyu lolv@microsoft.com
How I did it
get_mux_cable_entries will try to get the mux cable device from the devices list and get the cable type and soc ip address from the device definition.
How to verify it
Pass the unit-test
Why I did it
add celestica belgite platform
How I did it
add belgite platform in celestica
Co-authored-by: nicwu-cel <nicwu@celestica.com>
Co-authored-by: anjian <anjian@celestica.com>
Co-authored-by: sandycelestica <sandyli@celestica.com>
This update has following changes
Refactor pci topology logic for chassis (fixes some chassis commands and chassisd on linecard)
Introduce new cooling algorithm
Fix linecard poweroff logic when supervisor is going down
Fix linecard status led leading to system-health crashing
Misc fixes
Why I did it
At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.
How I did it
A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.
How to verify it
"show system-health sysready-status" click CLI
Syslogs for system ready