Why I did it
There is an outstanding FRR issue #12380. This seems to be a known issue but without good fix so far. The root cause is around zebra and kernel netlink interaction. The failure was previously not noticed by zebra.
How I did it
Port the patch that would make the issue obvious.
Signed-off-by: Ying Xie ying.xie@microsoft.com
Why I did it
#4021 describes an issue that is still being observed on master image whereby sensord does not start in pmon due to missing service.
How I did it
Updated the lm-sensors install patch with a case for systemd
How to verify it
Verified that sensord is up in pmon after boot
Co-authored-by: Boyang Yu <byu@arista.com>
* Add smartmontools to pmon docker
* Set smartmontools to install version 7.2-1 in pmon to match host; clean up smartmontools build files
* Add comments on smartmontools version for both host and pmon
#### Why I did it
To add new SKU Mellanox-SN2700-D44C10 with following requirements:
| Port configuration | Value |
| ------ |--------- |
| Breakout mode for each port |**Defined in port mapping** |
| Speed of the port | **Defined in Port mapping** |
| Auto-negotiation enable/disable | **No setting required** |
| FEC mode | **No setting required** |
|Type of transceiver used | **Not needed**|
Buffer configuration | Value
------ |---------
Shared headroom | **Enabled**
Shared headroom pool factor | **2**
Dynamic Buffer | **Disable**
In static buffer scenario how many uplinks and downlinks? | **44 x50G and 2x100G Downlinks 8x100G uplinks**
2km cable support required? | **No**
Switch configuration | Value
------ |---------
Warmboot enabled? | **yes**
Should warmboot be added to SAI profile when enabled? | **yes**
Is VxLAN source port range set? | **No**
Should Vxlan source port range be added to SAI profile when set. | **No**
Is Static Policy Based Hashing enabled? | **No**
Port Mapping
| Ports | Mode |
| ------ |--------- |
| 1,2 | 1x100G |
| 3-6 | 2x50G |
| 7-10 | 1x100G |
| 11-22 | 2x50G |
| 23-26 | 1x100G |
| 27-32 | 2x50G |
Number of Uplinks / Downlinks:
TO topology: **44 x50G and 2x100G Downlinks 8x100G uplinks**.
#### How I did it
Defined the SKU as per requirements
#### How to verify it
Load the SKU and verify if all links come up and traffic passes.
changed the platform device name under nokia directory; we now need to specify marvell armhf/arm64 to provide more accurate platform identity. otherwise onie discovery won't recognize the asic being installed.
Why I did it
when we load images using onie discovery, the process was failing because of marvell ASIC mismatch
How I did it
replace the platform asic with marvell-armhf under 7215
How to verify it
load a new image using http server and verify that the image can be loaded successfully
Why I did it
Fix apt-get remove/purge version not locked issue when the apt-get options not specified.
How I did it
Add a space character before and after the command line parameters.
- Why I did it
Fixes#11431
- How I did it
dhcp6relay binds to ipv6 addresses configured on these vlan interfaces
Thus check if they are ready before launching dhcp6relay
- How to verify it
Unit Tests
Tested on a live device
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Neetha John nejo@microsoft.com
Why I did it
slb and bgp mon peers are not needed for storage backend. These neighbor are present in the minigraph.
How I did it
After minigraph parsing, remove these neighbors if it is a storage backend device
How to verify it
Unit tests
Verified on the device that once these tables are removed, these peers don't show up in "show runningconfig bgp" output
co-authorized by: jianquanye@microsoft.com
Migrate the t0 and t1-lag test jobs in buildimage repo to TestbedV2.
Why I did it
Migrate the t0 and t1-lag test jobs in buildimage repo to TestbedV2.
How I did it
Migrate the t0 and t1-lag test jobs in buildimage repo to TestbedV2.
Remove ceos type setting
Use 202205 branch as sonic-mgmt branch
Why I did it
Change the path of sonic submodules that point to "Azure" to point to "sonic-net"
How I did it
Replace "Azure" with "sonic-net" on all relevant paths of sonic submodules
fix linecard provisioning issue (500 error)
fix some value types for get_system_eeprom_info API
refactor code to leverage pci topology (enabling dynamic Pcie plugin)
refactor asic declaration logic to new style
misc fixes
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.
This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.
How I did it
Delaying starting BGP seems to be a workaround for this issue.
However, caution is that this delay might impact warm reboot timing and other timing sequences.
This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.
How to verify it
Continuously issuing config reload and check BGP session status afterwards.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Why I did it
When sending a PR only CI change, as expected, the target target/python-wheels/buster/sonic_config_engine-1.0-py2-none-any.whl should be from the cache, because the depended files were not changed, but it rebuilt.
How I did it
Sort the files by name.
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed). Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread. However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`. The reason is unknown.
This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.
The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
To get following fixes:
be7da6b [sonic-installer] use host docker startup arguments when running dockerd in chroot (#2179) (#2407)
d112f7c [202205][auto-ts] add memory check (#2116) (#2413)
This is being done for 201911 -> 202205 warmboot upgrade, and Mellanox
platforms need to be able to mount the squashfs file in the old image.
This reverts commit d5365928d4.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
get_rx_los and get_tx_fault is not supported via the exisitng interface used, need provide dummy implementation for them.
NOTE: in later releases we will get them back via different interface.
- How I did it
Return False * lane_num for get_rx_los and get_tx_fault
- How to verify it
Added unit test