- Why I did it
To include latest fixes and new functionality
SDK/FW
1. Fixed bug in recovery mechanism in case of I2C error when trying to access the XSFP module.
2. On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
3. On the Spectrum-2 and Spectrum-3 switch, if you enable ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
4. Modifying existing entry/Adding new one when switch is at its maximum capacity (full by maximum allowed entries from any type such as routes, FDB, and so forth), will fail with an error.
5. When many ports are active (e.g., 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
6. When a system has more than 256 ACL rules, on rare occasion, removing/adding rules may cause some ACL rules not to work.
7. On SN2201 system, on RJ45 port, the link might appear in 'down' state even if it operations properly.
8. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event.
9. When setting LAG as a SPAN analyzer, the distributor mode of the LAG members was not taken into account. It may happen that the LAG member with distributor mode disabled will be set as a SPAN analyzer port.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
To improve ASIC FW upgrade logging and have information about the cause of FW update failure in the log.
- How I did it
Added syslog logger support
In case the FW update has failed the update tool will give the cause of the failure in the output in the last line, starting with "Fail".
When running the tool, in case of a failed update, we will parse the output to retrieve the cause and log it.
Device #1:
----------
Device Type: ConnectX6DX
Part Number: MCX623106AN-CDA_Ax
Description: ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0/3.0 x16;
PSID: MT_0000000359
PCI Device Name: /dev/mst/mt4125_pciconf0
Base GUID: 0c42a103007d22d4
Base MAC: 0c42a17d22d4
Versions: Current Available
FW 22.32.0498 22.32.0498
PXE 3.6.0500 3.6.0500
UEFI 14.25.0015 14.25.0015
Status: Forced update required
---------
Found 1 device(s) requiring firmware update...
Device #1: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Fail : The Digest in the signature is wrong
- How to verify it
mlnx-fw-upgrade.sh --upgrade
Add script usage and more information to script description being printed in help option.
- Why I did it
Missing information in script description in help option.
- How I did it
Expand script description and add script usage.
- How to verify it
Run the script with -h option.
Changing the default config knob value to be True for killing radv, due to the reasons below:
Killing RADV is to prevent sending the "cease to be advertising interface" protocol packet.
RFC 4861 says this ceasing packet as "should" instead of "must", considering that it's fatal to not do this.
In active-active scenario, host side might have difficulty distinguish if the "cease to be advertising interface" is for the last interface leaving.
6.2.5. Ceasing To Be an Advertising Interface
shutting down the system.
In such cases, the router SHOULD transmit one or more (but not more
than MAX_FINAL_RTR_ADVERTISEMENTS) final multicast Router
Advertisements on the interface with a Router Lifetime field of zero.
In the case of a router becoming a host, the system SHOULD also
depart from the all-routers IP multicast group on all interfaces on
which the router supports IP multicast (whether or not they had been
advertising interfaces). In addition, the host MUST ensure that
subsequent Neighbor Advertisement messages sent from the interface
have the Router flag set to zero.
sign-off: Jing Zhang zhangjing@microsoft.com
#### Why I did it
Improve naming convention for bgp notification events and change type of leaf for sonic-events-host mem usage from uint64 to decimal64
#### How I did it
Replace "-" with "_"
Replace uint64 with decimal64
#### How to verify it
Run yang model unit tests
#### Description for the changelog
Change YANG model leaf naming convention for bgp notification
#### Why I did it
Segfault was occuring when running memory_checker
#### How I did it
Deinit publisher immediately after publishing
#### How to verify it
Manual testing
Why I did it
Update Nokia sonic-platform submodule
81a9c77 [Supervisor] Modifed the get_description to fix the name for Nokia-IXR7250E-SUP-10 card.
e49ddfb Fix the LedContorlCommon to get the physical index from port mapping
dd143f1 [module] modify the chassis.py and module.py to allow supervisor to retrieve the line card eemprom info
How I did it
Update Nokia sonic-platform submodule
81a9c77 [Supervisor] Modifed the get_description to fix the name for Nokia-IXR7250E-SUP-10 card.
e49ddfb Fix the LedContorlCommon to get the physical index from port mapping
dd143f1 [module] modify the chassis.py and module.py to allow supervisor to retrieve the line card eemprom info
How to verify it
On supervisor, "show chassis module status" should show Nokia-IXR7250E-SUP-10 instead of Nokia-IXR7250-SUP-10
Signed-off-by: mlok <marty.lok@nokia.com>
How I did it
radv sends a good-bye packet when the service is stopped, which causes a IPv6 route update on SoC side. And this update leads to an interface bouncing and causes traffic disruption even though the ToR device might already be isolated.
This PR is to mitigate the traffic disruption issue during planned maintenance, by killing radv instead of stopping. So the cease packet won't be sent.
How to verify it
Verified on dev clusters:
Traffic disruption was no longer reproducible.
radv took the killing path
if knob was off, radv would take the stopping path
sign-off: Jing Zhang zhangjing@microsoft.com
*Critical process for database-chassis is redis-chassis but critical_process contains hard-coded
to `redis` program always. Instead using jinja2 template to render critical process list based on database docker type. redis-chassis for database-chassis docker and redis for regular database docker.
Why I did it
This PR is an enhancement of PR #13105
Because the input string of AttachTo for ACL table can appear in both port name group and port alias group, I added a logic to determine whether the string should be port name or port alias
If all the input strings belong to port name group, then we treat all of them as port name
If all the input strings belong to port alias, then we treat all of them as port alias
If all the input string belongs to both port alias group and port name group, we prefer port alias. The behavior is as before.
How I did it
Walk through all port names/alias in the input to make a decision.
How to verify it
Verified by adding UT.
a931d6c Prince George Wed Jan 18 19:10:55 2023 -0800 [Xcvrd]: Fix optics insertion/removal not detected (#333)
2211b7e mihirpat1 Wed Jan 18 16:00:22 2023 -0800 Xcvrd should restart if any child thread crashes (#326)
753b550 judyjoseph Tue Jan 17 13:10:09 2023 -0800 Chassisd do an explicit stop of the config_manager (#328)
879d630 Tal Berlowitz Fri Jan 6 01:57:42 2023 +0200 Fix bug where transceiver info is missing after port breakout change (#329)
e119b69 Junchao-Mellanox Tue Dec 13 19:54:49 2022 +0800 Remove TODO comments which are no longer needed (#325)
Signed-off-by: Mihir Patel <patelmi@microsoft.com>
Why I did it
Add explicit dependency on sonic_platform_common in sonic-chassisd mk. This was needed because sonic-chassisd depends on sonic-platform-base which is present in sonic-platform-common wheel package.
How I did it
Add explicit dependency on sonic_platform_common in sonic-chassisd mk.
How to verify it
Verified by building all platforms broadcom, mellanox, marvel_arm
According to its manual page,
"[dget in its] first form, [..] fetches the requested URLs.
If this is a .dsc or .changes file, then dget acts as a source-package
aware form of wget: it also fetches any files referenced in the
.dsc/.changes file.
The downloaded source is then checked with dscverify and,
if successful, unpacked by dpkg-source."
Thus, when possible, dget use is preferable to wget so that sources
authenticity can be performed automatically by dscverify"
Signed-off-by: Guillaume Lambert <guillaume.lambert@orange.com>
Why I did it
Fix the following issues for Seastone platform:
- system-health issue: show system-health detail will not complete #9530, Celestica Seastone DX010-C32: show system-health detail fails with 'Chassis' object has no attribute 'initizalize_system_led' #11322
- show platform firmware updates issue: Celestica Seastone DX010-C32: show platform firmware updates #11317
- other platform optimization
How I did it
Modify and optimize the platform implememtation.
How to verify it
Manual run the test commands described in these issues.
Why I did it
When getting system mac of centec platform, it would increase by 1 the last byte of mac, but it could not consider the case of carry.
How I did it
Firstly, I would replace the ":" with "" of mac to a string.
And then, I would convert the mac from string to int and increase by 1, at last convert it to string with inserting ":".
Why I did it
If make fails, we can't rerun the make process, because existing patches can't apply again.
How I did it
Check if patches are applied. if yes, don't apply patches again.
How to verify it
Why I did it
The current PTF library contains a typo - when building a VxLAN packet, it uses the VxLAN module directly from the scapy library which will cause test failures.
How I did it
Patch simple_vxlan_packet to use the VxLAN module wrapped/defined in packet.py from the PTF library.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Why I did it
There is a queue in sysmonitor.py that is created based on an object of multiprocessing.Manager.
After performing fast-reboot, system health monitor is being shut down, what causes this Manager to be shut down as well, since it is a child-process of healthd.
That's why I moved the creation of this Manager from the top of the file to the function Sysmonitor.system_service() (The only place it is used), to make Manager a child-process of Sysmonitor, instead of Healthd. This way both the queue (the Manager) and the processes that uses this queue will be child-processes of the same process, and the problematic scenario of sysmonitor sending messages to a dead queue will not be possible.
How I did it
Removed the definition of manager as global and moved it to system_service() function
How to verify it
Perform a fast reboot and verify the traceback issue is fixed
Why I did it
[Build] Support Debian snapshot mirror to improve build stability
It is to enhance the reproducible build, supports the Debian snapshot mirror. It guarantees all the docker images using the same Debian mirror snapshot and fixes the temporary build failure which is caused by remote Debain mirror indexes changed during the build. It is also to fix the version conflict issue caused by no fixed versions of some of the Debian packages.
How I did it
Add a new feature to support the Debian snapshot mirror.
How to verify it
Why I did it
This PR is to update minigraph.py to support both port alias and port name as input of AttachTo attribute of ACL table.
Before this change, only port alias is supported.
How I did it
Add a global variable to store port names
Search both port names and port alias wheh parsing the value of AttachTo.
How to verify it
Verified by a new unit test case test_minigraph_acl_attach_to_ports
Verified by copying the new minigraph.py to a testbed and run conflg load_minigraph.
Many of these switches have had flash upgraded beyond 2G however, in
boot0 both were assigned 2GB for legacy reasons.
Remove the hardcoding of the flash size and let boot0 autodetect the available space.
Signed-off-by: Graham Hayes <gr@ham.ie>
Signed-off-by: Graham Hayes <gr@ham.ie>
Why I did it
[Seastone] Enhancement fix for PR12200 syseeprom issue.
How I did it
Enhance the fix through replace the hardcoded devnum to bash variable
How to verify it
show platform syseeprom or decode-syseeprom
Why I did it
Ragile adapter ra-b6510-32c ra-b6510-48v8c ra-b6910-64c ra-b6920-4s to kernel 5.x
Signed-off-by: “pettershao” pettershao@ragilenetworks.com
#### Why I did it
Added SONiC YANG model for RADIUS.
Fixes https://github.com/sonic-net/sonic-buildimage/issues/12477
#### How I did it
Added the RADIUS and RADIUS_SERVER tables for global and per RADIUS server configuration. RADIUS statistics reside in COUNTERS_DB and are not part of the configuration. These are not a part of this PR.
#### How to verify it
Compiled sonic_yang_mgmt-1.0-py3-none-any.whl.
#### Description for the changelog
SONiC YANG model for RADIUS.
The main issue is the pip/pip3 command cannot be found when the package is being installed by apt-get.
When using the dpkg install, the searching path is PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When using the apt-get install, the searching path is PATH=/usr/sbin:/usr/bin:/sbin:/bin
But the pip/pip3 default path is at /usr/local/bin, so dpkg works, but apt-get not work.
How I did it
Export the path /usr/local/bin for pip/pip3.
Make the deb packages can be installed by apt-get.
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.
- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.
- How to verify it
a. Manual test:
- Perform a power loss
- Perform a warm/fast reboot
- Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
docker-sonic-mgmt build is failing.
How I did it
stretch docker is disabled recently. Update docker-sonic-mgmt to buster.
Migrate from sonictest to sonicbld. Because Azure requires migrate vm from uswest2 to uswest3.
Fix a build issue when build image.
How to verify it
Update sonic-swss-common submodule pointer to include the following:
a4987b9 Change the dtor of ProducerStateTable to virtual method (#735)
7be565c [hash]: Add GH DB schema. (#733)
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
This is primarily to fix the armhf build failure due to deepdiff python
module getting updated.
1eb7a5b Pin deepdiff to version 6.2.2
ae09e3f [caclmgrd][dualtor] add iptables rule for dualtor gRPC to allow packets getting forwarded from loopback IP
00cb8cb [hostcfgd] Optimize the hostcfgs by moving the definition cmds into the loop to optimize the enable/disable service command run.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
We plan to pilot k8s feature, need to fix several bugs including enable telemetry feature and add platform label.
How I did it
Add support feature set, only enable telemetry container upgrade for now
Add platform label for scheduler usage
Remove CNI installation code, it would be auto installed when install kubeadm
How to verify it
After sonic device join k8s cluster, show node labels to check if platform label is visible.
Signed-off-by: Yun Li yunli1@microsoft.com
Fix code issue when SonicV2Connector.get() return None.
#### Why I did it
When database key does not exist, SonicV2Connector.get() will return None.
Code will break if not check return value.
#### How I did it
Check SonicV2Connector.get() return value before use it.
#### How to verify it
Pass all E2E test case.
Why I did it
docker-sonic-slave pipeline has a different tag with PR build.
It leads to ENABLE_DOCKER_BASE_PUll=y not work.
How I did it
set reproducible build option in bash.
How to verify it
The display of azure pipeline is not specific now, such as when the step Run test fails, the display of itself shows successful, but the display of step Kvmdump shows fails, but actually, the step Kvmdump doesn't fail. I improve the display of azure pipeline in this pr, each step has its own success or failure, and is shown in azure pipeline.
Why I did it
The display of azure pipeline is not specific now, such as when the step Run test fails, the display of itself shows successful, but the display of step Kvmdump shows fails, but actually, the step Kvmdump doesn't fail. I improve the display of azure pipeline in this pr, each step has its own success or failure, and is shown in azure pipeline.
How I did it
1. Each step has its own signature of success or failure.
2. Using the chain of responsibility pattern to manage all status.
3. Modify the expected-state in each step.
if there is no request, you need to use curl to get data from bmc, and each query needs to start a curl process. pmon is a circular query, which will pull up multiple processes in a loop, which consumes a lot. Using request does not need to pull up the process.
The console of the centec-arm64 board is ttyAMA0.The current regular expression cannot be correctly parsed.
Signed-off-by: centecqianj <qianj@centec.com>