Why I did it:
Fix multiple seastone platform issues caused by sonic kernel upgrade.
How I did it:
Get gpio base id with new label path in gpio sys fs.
How to verify it:
After the change, show platform fan/psustatus/temperature works well.
The only platforms that currently need the stretch slave container are
innovium and nephos, and both are not building with the current code due
to other issues. All other platforms only need buster and bullseye slave
containers.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
smartctl tool is available only in PMON docker. Hence, the tool may be not accessible incase PMON docker goes down.
Using iSMART_64 tool to fetch the SSD firmware version and device model information.
How I did it
Replacing smartctl with iSMART_64.
1d53bf4 Skip platform NDK health check two times in watchdog.sh
d68297c Added code to shutdown the channel after the grpc call also fixed the show fp-status command
0769efe Impelemented the module API to return the correct eeprom info for fabric card.
171569c Remove explicit logger identifier for transceiver module operations; use inherited id
6c4d651 Corrected the log messages for firmware install
Signed-off-by: mlok <marty.lok@nokia.com>
Recently, the job of t0-sonic runs stably in 202205 branch, so in this pr, I set it mandatory in azure pipeline.
Why I did it
Recently, the job of t0-sonic runs stably in 202205 branch, so in this pr, I set it mandatory in azure pipeline.
How I did it
Modify the value of continueOnError in this job from `true` to `false`.
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Previously, we set timeout in each step such as Lock testbed, Prepare testbed, Run test and KVM dump. When some issue suck like retry happens in one step, it will cause timeout error, but actually, it only needs more time to success. In this pr, we remove the timeout limit in each step and control the timeout outside in each job. When the job runs more than four hours, it will be cancelled.
Why I did it
Previously, we set timeout in each step such as Lock testbed, Prepare testbed, Run test and KVM dump. When some issue suck like retry happens in one step, it will cause timeout error, but actually, it only needs more time to success. In this pr, we remove the timeout limit in each step and control the timeout outside in each job. When the job runs more than four hours, it will be cancelled.
How I did it
Remove the timeout parameter in each step, and control the timeout outside in each job.
How to verify it
Set the timeout of one job to 4 hours, and when timeout happens, azure pipeline will cancel this job.
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Why I did it
enable sai-ptf logger in sai_adapter to log all the sai api invcations
How I did it
add build parameter to enable the sai-ptf logger when build sai PRC
How to verify it
local build test
test the generated sai_adapter
test with pipeline
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
* Fix kube mode to local mode long duration issue
* Remove IPV6 parameters which is not necessary
* Fix read node labels bug
* Tag the running image to latest if it's stable
* Disable image_version_higher check
* Change image_version_higher checker test case
Signed-off-by: Yun Li <yunli1@microsoft.com>
Signed-off-by: Yun Li <yunli1@microsoft.com>
Co-authored-by: lixiaoyuner <35456895+lixiaoyuner@users.noreply.github.com>
Why I did it
Publish docker saiserverv2 in the build pipeline.
How I did it
Cherry-pick #12842 from master to 202205 branch.
How to verify it
Run test in #12843 and it has been built out successfully.
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
ECN parameters need to be updated for storage backend
How I did it
Included the check for storage backend devices to update qos configs
How to verify it
Verified that the new ecn settings are applied on storage backend device.
Verified that the old ecn settings are applied for storage frontend, non storage frontend/backend devices
Why I did it
Added to allow test_crm_route to pass; the test tries to add a /126 ipv6 route and this change is required in order for the count of available routes to be updated correctly.
Why I did it
There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.
Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).
Reloading the BCM SDK kmods allows the switch init to continue properly.
How I did it
If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.
How to verify it
Steps to reproduce:
In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
Run 'docker stop syncd'
Wait ~1 minute.
Run 'docker ps' to see that syncd is missing.
Check logs to see messages similar to the above.
Signed-off-by: Michael Li <michael.li@broadcom.com>
Previously, we hard code the min and max numbers of instance in a plan. In this pr, we support passing the instance numbers of a testplan.
Why I did it
Previously, we hard code the min and max numbers of instance in a plan. In this pr, we support passing the instance numbers of a testplan.
How I did it
Use a variable to set the instance number.
swss:
* 7e274a4 2022-11-18 | [Fdbsyncd] Bug Fix for remote MAC move to local MAC and Fix for Static MAC advertisement in EVPN. (#2521) (HEAD -> 202205, github/202205) [KISHORE KUNAL]
* 434e80c 2022-11-02 | Fix vs test issue: failed to remove vlan due to referenced by vlan interface (#2504) [Stephen Sun]
* 11bef87 2022-11-27 | [dual-tor] add missing SAI attribte in order to create IPNIP tunnel (#2503) [Andriy Yurkiv]
* 11aba29 2022-11-09 | [SWSS] Innovium platform specific changes in PFC Detect lua script (#2493) [maulik_patel_marvell]
* 4a165ee 2022-11-14 | Revert "[vlanmgr] Disable `arp_evict_nocarrier` for vlan host intf (#2469)" (#2518) [Longxiang Lyu]
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Signed-off-by: Saikrishna Arcot sarcot@microsoft.com
Why I did it
The current error handling code for when a deb package fails to be
installed currently has a chain of commands linked together by && and
ends with exit 1. The assumption is that the commands would succeed,
and the last exit 1 would end it with a non-zero return code, thus
fully failing the target and causing the build to stop because of bash's
-e flag.
However, if one of the commands prior to exit 1 returns a non-zero
return code, then bash won't actually treat it as a terminating error.
From bash's man page:
-e Exit immediately if a pipeline (which may consist of a single simple
command), a list, or a compound command (see SHELL GRAMMAR above),
exits with a non-zero status. The shell does not exit if the
command that fails is part of the command list immediately
following a while or until keyword, part of the test following the
if or elif reserved words, part of any command executed in a && or
|| list except the command following the final && or ||, any
command in a pipeline but the last, or if the command's return
value is being inverted with !. If a compound command other than a
subshell returns a non-zero status because a command failed while
-e was being ignored, the shell does not exit.
The part part of any command executed in a && or || list except the command following the final && or || says that if the failing command
is not the exit 1 that we have at the end, then bash doesn't treat it
as an error and exit immediately. Additionally, since this is a compound
command, but isn't in a subshell (subshell are marked by ( and ),
whereas { and } just tells bash to run the commands in the current
environment), bash doesn't exist. The result of this is that in the
deb-install target, if a package installation fails, it may be
infinitely stuck in that while-loop.
This was seen when the snmpd package upgrade happened, and
builds were failing to install the mismatching libsnmp-dev package,
the builds did not immediately terminate; instead, the installation
was retried again and again, suggesting it was stuck in some infinite
loop. The build jobs finally terminated only because of the timeout
specified for the jobs.
How I did it
There are two fixes for this: change to using a subshell, or use ;
instead of &&. Using a subshell would, I think, require exporting any
shell variables used in the subshell, so I chose to change the && to
;. In addition, at the start of the subshell, set +e is added in,
which removes the exit-on-error handling of bash. This makes sure that
all commands are run (the output of which may help for debugging) and
that it still exits with 1, which will then fully fail the target.
How to verify it
Why I did it
Some sonic-mgmt platform_tests/api were failing on the 7060CX-32S
How I did it
Added the missing metadata in platform.json and platform_components.json
This is purely test data and does not impact our API implementation.
How to verify it
Run platform_tests / api and expect 100% pass rate.
Why I did it
Some sonic-mgmt platform_tests/api were failing on the 7260CX3-64
How I did it
Added the missing metadata in platform.json and platform_components.json
This is purely test data and does not impact our API implementation.
How to verify it
Run platform_tests/api and expect 100% passrate.
Why I did it
The PR is to apply separated DSCP_TO_TC_MAP and TC_TO_QUEUE_MAP to uplink ports on dualtor.
The traffic with DSCP 2 and DSCP 6 from T1 is treated as lossless traffic.
DSCP TC Queue
2 2 2
6 6 6
Traffic with DSCP 2 or DSCP 6 from downlink is still treated as lossy traffic as before.
How I did it
Define DSCP_TO_TC_MAP|AZURE_UPLINK and TC_TO_QUEUE_MAP|AZURE_UPLINK.
How to verify it
Verified by UT
Verified by coping the new template to a testbed, and rendering a config_db.json
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
There is a need to have separate profiles on compute and storage and this infra update will help achieve that
How I did it
Moved buffer pool/profile and qos definitions on TD2 to a common folder and all TD2 hwsku's will reference that folder
Why I did it
Move armhf syncd docker compilation to bullseye.
How I did it
compile syncd docker for armhf platform using below commands,
NOJESSIE=1 NOSTRETCH=1 NOBUSTER=1 BLDENV=bullseye make configure PLATFORM=marvell-armhf PLATFORM_ARCH=armhf
NOJESSIE=1 NOSTRETCH=1 NOBUSTER=1 BLDENV=bullseye make target/docker-syncd-mrvl.gz
How to verify it
upgrade the syncd docker and verify ports are up.
Signed-off-by: rajkumar38 <rpennadamram@marvell.com>
Why I did it
TX FIR tuning should be done based on the type of inserted transceiver
How I did it
Add media_settings.json which contains the tuning data for 100G optic and 400G optic.
How to verify it
Tested against x86_64-arista_7800r3a_36d2_lc
- Why I did it
A new SKU for MSN4700 Platform i.e. Mellanox-SN4700-V16A96
Requirements:
Breakout:
Port 1-24: 4x25G(4)[10G,1G]
Port 25-28: 2x100G[200G,50G,40G,25G,10G,1G]
Port 29-32: 2x200G[100G,50G,40G,25G,10G,1G]
Downlinks: 96 (1-24) + 4 (25-28)
Uplinks: 4 (29-32)
Shared Headroom: Enabled
Over Subscribe Ratio: 1:4
Default Topology: T0
Default Cable Length for T1: 5m
VxLAN source port range set: No
Static Policy Based Hashing Supported: No
Additional Details:
QoS params: The default ones defined in qos_config.j2 will be applied
Small Packet Percentage: Used 50% for traditional buffer model Note: For dynamic model, the value defined in LOSSLESS_TRAFFIC_PATTERN|AZURE|small_packet_percentage is used
SKU was drafted under the assumption that the downlink ports uses xcvr's that will only support the first 4 lanes of the physical port they are connected to. Hence for the ports 1-24, the last four lanes are not used
Cable Lengths used for generating buffer_defaults_{t0,t1}.j2 values
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
A new SKU for MSN4700 Platform i.e. Mellanox-SN4700-V48C32
Requirements:
Breakout:
Port 1-24: 2x200G
Port 25-32: 4x100G
Downlinks: 48 (1-24)
Uplinks: 32 (25-32)
Shared Headroom: Enabled
Over Subscribe Ratio: 1:8
Default Topology: T1
Default Cable Length for T1: 300m
VxLAN source port range set: No
Static Policy Based Hashing Supported: No
Additional Details:
QoS params: The default ones defined in qos_config.j2 will be applied
Small Packet Percentage: Used 50% for traditional buffer model Note: For dynamic model, the value defined in LOSSLESS_TRAFFIC_PATTERN|AZURE|small_packet_percentage is used
Cable Lengths used for generating buffer_defaults_{t0,t1}.j2 values
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Why I did it
The current lazy installer relies on a filename sort for both unpack and configuration steps. When systemd services are configured [started] by multiple packages the order is by filename not by the declared package dependencies. This can cause the start order of services to differ between first-boot and subsequent boots. Declared systemd service dependencies further exacerbate the issue (e.g. blocking the first-boot script).
The current installer leaves packages un-configured if the package dependency order does not match the filename order.
This also fixes a trivial bug in [Build]: Support to use symbol links for lazy installation targets to reduce the image size #10923 where externally downloaded dependencies are duplicated across lazy package device directories.
How I did it
Changed the staging and first-boot scripts to use apt-get:
dpkg -i /host/image-$SONIC_VERSION/platform/$platform/*.deb
becomes
apt-get -y install /host/image-$SONIC_VERSION/platform/$platform/*.deb
when dependencies are detected during image staging.
How to verify it
Apt-get critical rules
Add a Depends= to the control information of a package. Grep the syslog for rc.local between images and observe the configuration order of packages change.