Why I did it
This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime.
How I did it
How to verify it
Tested manually with these cases:
Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect
Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds)
Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason
Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count)
Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
#### Why I did it
When removing port from LAG while traffic is running thorough LAG there is traffic disruption of 60 seconds.
Fix issue https://github.com/sonic-net/sonic-buildimage/issues/14381
#### How I did it
The patch I added introduces "port_removing" op and call it right before Kernel is asked to remove the port.
Implement the op in LACP runner to disable the port which leads to proper LACPDU send.
#### How to verify it
Set LAG between 2 switches.
Set LAGs to be router port and set ip address.
In switch A send ping to ip address of LAG in switch B.
In switch B, while ping is running remove port from LAG.
Verify ping is not stopping.
#### Why I did it
Fix issue: wrong teamd link watch state after warm reboot due to TEAM_ATTR_PORT_CHANGED lost
The flag TEAM_ATTR_PORT_CHANGED is maintained by kernel team driver:
- a flag "changed" is maintained in struct team_port struct
- the flag is set by __team_port_change_send once relevant information is updated, including port linkup (together with speed, duplex), adding or removing
- the flag is cleared by team_nl_fill_one_port_get once the updated information has been notified to user space via RTNL
In the userspace, the change flag is maintained by libteam in struct team_port.
The team daemon calls port_priv_change_handler_func on receiving port change event.
The logic in port_priv_change_handler_func
1. creates the port if it did not exist, which triggers port add event and eventually calls lacp_port_added callback.
2. triggers port change event if team_port->changed is true, which eventually calls lw_ethtool_event_watch_port_changed to update port state for link watch ethtool.
3. removes the port if team_port->removed is removed
In lacp_port_added, it calls team_refresh to refresh ifinfo, port info, and option info from the kernel via RTNL.
In this step, port_priv_change_handler_func is called recursively.
- In the inner call, it won't get TEAM_ATTR_PORT_CHANGED flag because kernel has already notified that.
- As a result, team_port->changed flag is cleared in the libteam.
- The port change event won't be triggered from either inner or outer call of port_priv_change_handler_func.
If the port has been up when the port is being added to the team device, the "port up" information is carried in the outer call but will be lost.
In case the flag TEAM_ATTR_PORT_CHANGED is set only in the inner call, function port_priv_change_handler_func can be called in the inner call.
However, it will fail to fetch "enable" options because option_list_init has not be called.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
#### How I did it
Fix:
Do not call check_call_change_handlers when parsing RTNL function is called from another check_call_change_handlers recursively.
#### How to verify it
- Manually test
- Regression test
- warm reboot
- warm reboot sad lag
- warm reboot sad lag member
- warm reboot sad (partial)
* Ported Marvell armhf build on x86 for debian buster to use cross-compilation instead of qemu emulation
Current armhf Sonic build on amd64 host uses qemu emulation. Due to the
nature of the emulation it takes a very long time, about 22-24 hours to
complete the build. The change I did to reduce the building time by
porting Sonic armhf build on amd64 host for Marvell platform for debian
buster to use cross-compilation on arm64 host for armhf target. The
overall Sonic armhf building time using cross-compilation reduced to
about 6 hours.
Signed-off-by: marvell <marvell@cpss-build3.marvell.com>
* Fixed final Sonic image build with dockers inside
* Update Dockerfile.j2
Fixed qemu-user-static:x86_64-aarch64-5.0.0-2 .
* Update cross-build-arm-python-reqirements.sh
Added support for both armhf and arm64 cross-build platform using $PY_PLAT environment variable.
* Update Makefile
Added TARGET=<cross-target> for armhf/arm64 cross-compilation.
* Reviewer's @qiluo-msft requests done
Signed-off-by: marvell <marvell@cpss-build3.marvell.com>
* Added new radius/pam patch for arm64 support
* Update slave.mk
Added missing back tick.
* Added libgtest-dev: libgmock-dev: to the buster Dockerfile.j2. Fixed arm perl version to be generic
* Added missing armhf/arm64 entries in /etc/apt/sources.list
* fix libc-bin core dump issue from xumia:fix-libc-bin-install-issue commit
* Removed unnecessary 'apt-get update' from sonic-slave-buster/Dockerfile.j2
* Fixed saiarcot895 reviewer's requests
* Fixed README and replaced 'sed/awk' with patches
* Fixed ntp build to use openssl
* Unuse sonic-slave-buster/cross-build-arm-python-reqirements.sh script (put all prebuilt python packages cross-compilation/install inside Dockerfile.j2). Fixed src/snmpd/Makefile to use -j1 in all cases
* Clean armhf cross-compilation build fixes
* Ported cross-compilation armhf build to bullseye
* Additional change for bullseye
* Set CROSS_BUILD_ENVIRON default value n
* Removed python2 references
* Fixes after merge with the upstream
* Deleted unused sonic-slave-buster/cross-build-arm-python-reqirements.sh file
* Fixed 2 @saiarcot895 requests
* Fixed @saiarcot895 reviewer's requests
* Removed use of prebuilt python wheels
* Incorporated saiarcot895 CC/CXX and other simplification/generalization changes
Signed-off-by: marvell <marvell@cpss-build3.marvell.com>
* Fixed saiarcot895 reviewer's additional requests
* src/libyang/patch/debian-packaging-files.patch
* Removed --no-deps option when installing wheels. Removed unnecessary lazy_object_proxy arm python3 package instalation
Co-authored-by: marvell <marvell@cpss-build3.marvell.com>
Co-authored-by: marvell <marvell@cpss-build2.marvell.com>
with state of tdport from previous warm-reboot.
In case LAG was down before reboot, lacp->wr is not cleared.
In lacp_event_watch_port_flush_data we incremented nr_of_tdports and add
tdport to lacp->wr.state. In case lacp->wr.state already had this tdport
we do not set new state for tdport but appened a new item in
lacp->wr.state. In case we preformed warm-reboot and PortChannel member
was down, after reboot PortChannel member became up next warm-reboot
will initialize teamd with PortChannel member in down state.
Fix this issue by calling stop_wr_mode() when LAG was down. This was probably intended but missed.
#### Why I did it
To fix an issue seen in warm-reboot-sad test cases.
#### How I did it
I fixed it in SONiC libteam patch that adds warm-reboot support. Details in commit description.
#### How to verify it
Run warm-reboot-sad test on t0-56 topology.
#### Why I did it
Restrict the min-links parameter in "config portchannel" to the range 1-1024.
FixesAzure/sonic-buildimage#6781 in conjunction with https://github.com/Azure/sonic-buildimage/pull/1630.
Align YANG model with limits in libteam and sonic-utilties.
#### How I did it
PR 1630 in sonic-utilities prevents CLI user from entering a value outside the allowed range. This PR does the following:
- Increases the maximum value of min-links from 128 to 1024.
- Provides validation in libteam, incorporating as a patch the code in https://git.kernel.org/pub/scm/linux/kernel/git/jpirko/libteam.git/commit/?id=69a7494bb77dc10bb27076add07b380dbd778592.
- Updates the Yang model upper limit from 128 to 1024 (was inconsistent with libteam value).
- Updates the Yang model lower limit from 1 to 0, since 0 is set as default in sonic-utilities which would fail its new range check otherwise.
- Added Yang tests for valid and invalid value.
#### How to verify it
config portchannel add PortChannel0004 --min-links 1024
Command should be accepted.
show interfaces portchannel
Output should show PortChannel0004, no errors on CLI.
config portchannel add PortChannel0005 --min-links 1025
Command should be rejected
show interfaces portchannel
Output should not show PortChannel0005 , no errors on CLI.
#### Which release branch to backport (provide reason below if selected)
#### Description for the changelog
Updates YANG model to allow up to 1024 min_links for portchannel. FixesAzure/sonic-buildimage#6781 in conjunction with https://github.com/Azure/sonic-buildimage/pull/1630.
Fix#119
when parallel build is enable, multiple dpkg-buildpackage
instances are running at the same time. /var/lib/dpkg is shared
by all instances and the /var/lib/dpkg/updates could be corrupted
and cause the build failure.
the fix is to use overlay fs to mount separate /var/lib/dpkg
for each dpkg-buildpackage instance so that they are not affecting
each other.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
- Add .gitignore files in each subdirectory of src/, so as to reduce the size of the .gitignore file in the project root, and also make it easier to maintain (i.e., if a directory in src/ is removed, there will not be outdated entries in the root .gitignore file.
- Also add missing .gitignore entries and remove outdated entries and duplicates.
- What I did
Ported a fix from libteam master to our master.
Fixes#4070Fixes#3649
- How I did it
Applied patch jpirko/libteam@c723737 from upstream.
- How to verify it
Build image for your DUT and warm-reboot your DUT 10 times. Check that all PortChannels are up and no error messages in teamd.log
In teamd v1.28, the port can only be enabled if sync bit is set
in recveived LACPDU from partner by the following commit
"teamd: lacp: update port state according to partner's sync bit"
(54f137c105)
However, lacp fallback feature needs to enable port even if partner
LACPDU is not received within a given period and fallback cfg is enabled.
To fix the lacp fallback breakage, we have to bypass the sync bit
check in lacp fallback mode.
Signed-off-by: Haiyang Zheng <haiyang.z@alibaba-inc.com>
The race condition could happen like this:
When an interface is enslaved into the port channel immediately after
it is created, the order of creating the ifinfo and linking the ifinfo to
the port is not guaranteed.
Please check the patch commit message to get full details.
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
ARM Architecture support in SONIC
make configure platform=[ASIC_VENDOR_ARCH] PLATFORM_ARCH=[ARM_ARCH]
SONIC_ARCH: default amd64
armhf - arm32bit
arm64 - arm64bit
Signed-off-by: Antony Rheneus <arheneus@marvell.com>
* Added debug symbols to many debug dockers.
* For debug images *only*:
1) Archive source files into debug image
2) Archived source is copied into /src
3) Created an empty dir /debug
4) Mount both /src as ro & /debug as rw into every docker
5) Login banner will give some details on /src & /debug
6) Devs can copy core file into /debug and view it from inside a container.
7) Dev may create all gdb logs and other data directly into /debug.
* Dropped redundant REDIS_TOOLS per review comments.
* Added debug symbols to frr package and hence FRR based BGP docker.
* 1) Moved dbg_files.sh to scripts/
2) Src directories to archive are now collected from individual Makefiles.
3) Added few more debug symbols
4) Added few more debug dockers.
Here after no more changes except per review comments.
To debug:
Install required version of debug image in Switch or VM.
Copy core file into /debug of host
Get into Docker
gdb /usr/bin/<daemon> -c /debug/<your core file>
set directory /src/... <-- inside gdb to get the source
For non-in-depth debugging:
Download corresponding debug Docker image (docker-...-dbg.gz) to your VM
Load the image
Run image with entrypoint as 'bash' with dir containing core mapped in.
Run gdb on the core.
Port libteam patch which fixes the race condition we observed during
warm reboot.
Remove early patches: 0006, 0008, 0009.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Backport of
54f137c105
According to 6.4.15 of IEEE 802.1AX-2014, Figure 6-22, the state that the
port is selected moves MUX state from DETACHED to ATTACHED.
But ATTACHED state does not mean that the port can send and receive user
frames. COLLECTING_DISTRIBUTION state is the state that the port can send
and receive user frames. To move MUX state from ATTACHED to
COLLECTING_DISTRIBUTION, the partner state should be sync as well as the
port selected.
In function lacp_port_actor_update(), only INFO_STATE_SYNCHRONIZATION
should be set to the actor.state when the port is selected.
INFO_STATE_COLLECTING and INFO_STATE_DISTRIBUTING should be set to false
with ATTACHED mode and set to true when INFO_STATE_SYNCHRONIZATION of
partner.state is set.
In function lacp_port_should_be_{enabled, disabled}(), we also need to
check the INFO_STATE_SYNCHRONIZATION bit of partner.state.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.
While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.
Prevent per port priv change handler re-entrance solves the problem.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.
Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
The race condition could happen like this:
When an interface is enslaved into the port channel immediately after
it is created, the order of creating the ifinfo and linking the ifinfo to
the port is not guaranteed.
Please check the patch commit message to get full details.
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
When using actor port number 0 in lag configuration, IO cannot be sent to
peer. Increase actor port number by 1 to keep uniqueness and at the same
time, avoid using actor port number 0.
Ref. 802.1AX 6.3.4 Port identification
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
- What I did
Fixed vanilla teamd bug, which prevented teamd to have a correct view of kernel state. Check bug #2 from the message
Changed schema for LACP port id.
Changed severity of an error message.
Removed logic to disable warm_start_read mode, when teamd started. It didn't work in system restart mode, because interfaces were added one by one, and it's impossible to say when everything is added.
- How I did it
I've added team_refresh() on every port addition
I extract port id from the port name. Currently I support only "EthernetX" scheme. We need to add more schemes if we change port scheme.
_err -> _info
...
- How to verify it
Build the image, install on your DUT, reboot it once, then reboot it on WR mode checking LACP state on remote side. The state shouldn't flip.
* Don't put down LAG interface when it starts in WR mode
* Change logic. Don't touch carrier in WR mode. Until it could be in UP mode
* Change control plane restore logic in WR mode
* Fix teamd behavior for Warm-reboot mode
* Don't save 'read' state into the struct. Try to read a lacp file everytime when a port starts.
* Fix filename for access()
* [libteam] Add fallback support for single-member-port LAG
* Allow the port to be selected if the LAG is configured
with fallback and port is in defaulted state due to missing
LACP PDUs from remote end
* Only enable port if LAG is admin up and the member port
is link up
* [team] Add lacp fallback config to teamd.j2 template
* [teamd] Resolve config conflict between fallback and minlink
* Remove min_link config if fallback is configured
* Add support for fallback config in minigraph
* [teamd] Only enable fallback if it is single-member-port LAG
Signed-off-by: Haiyang Zheng <haiyang.z@alibaba-inc.com>
* [teamd] Removing the admin status check in lacp_port_link_update
Will submit another pull request to fix this issue.
Signed-off-by: Haiyang Zheng <haiyang.z@alibaba-inc.com>
* [config]: Add SONIC_CONFIG_MAKE_JOBS
This config option allows user to specify -j value that will be passed
to each package build.
Signed-off-by: marian-pritsak <marianp@mellanox.com>
* Build improvements
Fix dependencies
Add configuration options
Automatically build sonic-slave
* Set default number of jobs to 1
* Auto generate target/debs directory
Signed-off-by: marian-pritsak <marianp@mellanox.com>
* Automatically remove sonic-slave container after exit
* Silence clean-logs
* Add SONIC_CLEAN_TARGETS to clean
* Use second expansion for clean dependencies
* Avoid creating empty log files
Remove log file on flush instead of writing empty string
* Put dpkg install inside lock
Use same lock as debian install targets do to avoid
race condition in dpkg installation
* Remove redirect to log from docker save
* Add .platform dependency to all and clean targets
* Remove header and footer from clean targets
* Disable messages for SONIC_CLEAN_TARGETS
* Exit with error if dpkg-buildpackage fails
* Set new location for debs in build_debian.sh
* Add recipe for docker-database
* Update redis version to 3.2.4
* Add support for p4 platform
* Add recipe for snmpd
* Add slave targets to phony and make all target default
* Remove build.sh from thrift
* Add versioning to team, nl, hiredis and initramfs
* Change sonic-slave to support snmpd build from sources
* Remove src/tenjin
* Add recipe for lldpd
* Add recipe for mpdecimal
* Remove hiredis directory on rebuild
* Add recipe for Mellanox hw management
* Remove generic image from all targets for Mellanox
* Add support for python wheels
* Add lldp and snmp dockers
* Sync docker-database to include libjemalloc
* Fix asyncsnmp variable name
* Change default build configuration
Redirect output to log files by default
Set number of jobs to nproc value
Do not print dependencies
Fix logging to print log of failed job into console
* Use docker inspect to check if sonic-slave image exists
* Use config in slave.mk directly
* Disable color output by default
* Remove sswsdk dependency from lldp and snmp dockers
* Fix comment in py wheels install targets
* Add dependency between two versions of sswsdk
* Add containers to mellanox platform
lldp, snmp and database containers
* Add recipe for team docker
* Add team docker to mellanox platform
* Encrypt password passed to build_debian.sh
* Update mellanox SAI version
Make version and revision setting only in main recipe
* Fix error handling in makefiles
As makefiles use .ONESHELL we should add -e
option to shell options in order to exit after any command fails
* Add recipe for platform monitor image
* Add platfotm monitor to mellanox targets
* Ignore submodules when building base image