Why I did it
pkgs.k8s.io: Introducing Kubernetes Community-Owned Package Repositories | Kubernetes
For 1.22.2 k8s packages, source repo has been deprecated, going to store these packages in sonic build storage for installation to mitigate the issue. Will migrate to new repo when we are ready to upgrade k8s version.
Work item tracking
Microsoft ADO (number only): 27075924
How I did it
Store the 1.22.2 k8s package in sonic build storage and install the package there.
How to verify it
"apt list" to check if it's installed.
Why I did it
deb11u1 is deprecated.
Use deb11u2 instead.
Other branches are not impacted, because their reproducible build version files are up to date.
Work item tracking
Microsoft ADO (number only): 26964185
How I did it
How to verify it
Why I did it
[202211, PR test] Use feature branch as the sonic-mgmt branch
Work item tracking
Microsoft ADO (number only):
How I did it
How to verify it
Signed-off-by: Nazarii Hnydyn nazariig@nvidia.com
A W/A to overcome SN2700 20 sec delay on login due to MFT bash autocompletion bug.
Must be reverted once a new MFT is ready.
Why I did it
To overcome SN2700 20 sec delay on login
Work item tracking
N/A
How I did it
Removed MFT bash autocompletion part
How to verify it
Build a mellanox image
Verify no such links after system boot.
To modify EEPROM API serial_number_str to return service tag instead of serial number in Dell S6100.
Ref PR: #1239
How I did it
Update EEPROM API serial_number_str to return service tag instead of serial number.
How to verify it
Verify decode-syseeprom -s returns service tag in Dell S6100.
#### Why I did it
src/sonic-linux-kernel
```
* 271f7ef - (HEAD -> 202211, origin/202211) arm64: Kconfig inclusions to fix PCI hang and MTD detection (#358) (28 hours ago) [Pavan Naregundi]
```
#### How I did it
#### How to verify it
#### Description for the changelog
#### Why I did it
src/sonic-utilities
```
* 29511a5a - (HEAD -> 202211, origin/202211) [dualtor] Add script to verify consistency between kernel and ASIC (#2840) (5 hours ago) [Longxiang Lyu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Fix#13561
The existing saidump use https://github.com/sonic-net/sonic-swss-common/blob/master/common/table_dump.lua script which loops the ASIC_DB more than 5 seconds and blocks other processes access.
This solution uses the Redis SAVE command to save the snapshot of DB each time and recover later, instead of looping through each entry in the table.
Related PRs:
sonic-net/sonic-utilities#2972sonic-net/sonic-sairedis#1288sonic-net/sonic-sairedis#1298
How did I do it?
To use the Redis-db SAVE option to save the snapshot of DB each time and recover later, instead of looping through each entry in the table and saving it.
1. Updated dockers/docker-base-bullseye/Dockerfile.j2, install Python library rdbtools into the all the docker-base-bullseye containers.
2. Updated sonic-buildimage/src/sonic-sairedis/saidump/saidump.cpp, add a new option -r, which updates the rdbtools's output-JSON files' format.
3. To add a new script file: syncd/scripts/saidump.sh into the sairedis repo. This shell script does the following steps:
For each ASIC, such as ASIC0,
3.1. Config Redis consistency directory.
redis-cli -h $hostname -p $port CONFIG SET dir $redis_dir > /dev/null
3.2. Save the Redis data.
redis-cli -h $hostname -p $port SAVE > /dev/null
3.3. Run rdb command to convert the dump files into JSON files
rdb --command json $redis_dir/dump.rdb | tee $redis_dir/dump.json > /dev/null
3.4. Run saidump -r to update the JSON files' format as same as the saidump before.
Then we can get the saidump's result in standard output."
saidump -r $redis_dir/dump.json -m 100
3.5. Clear the temporary files.
rm -f $redis_dir/dump.rdb
rm -f $redis_dir/dump.json
4. Update sonic-buildimage/src/sonic-utilities/scripts/generate_dump. To check the asic db size and if it is larger than ROUTE_TAB_LIMIT_DIRECT_ITERATION (with default value 24000) entries, then do with REDIS SAVE, otherwise, to do with old method: looping through each entry of Redis DB.
How to verify it
On T2 setup with more than 96K routes, execute CLI command -- generate_dump
No error should be shown
Download the generate_dump result and verify the saidump file after unpacking it.
Why I did it
A race condition exists while the TPH is processing a netlink message - if a second netlink message arrives during processing it will be missed since TPH is not listening for other messages.
Another bug was found where TPH was unnecessarily restarting since it was checking admin status instead of operational status of portchannels.
How I did it
Subscribe to APPL_DB for updates on LAG operational state
Track currently sniffed interfaces
How to verify it
Send tunnel packets with destination IP of an unresolved neighbor, verify that ping commands are run
Shut down a portchannel interface, verify that sniffer does not restart
Send tunnel packets, verify ping commands are still run
Bring up portchannel interface, verify that sniffer restarts
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Why I did it
This backports #16859, and #16896 to the 202211 branch, for fixing the subsequent build failures after the OpenSSH update to 1:8.4p1-5+deb11u2. This backport is being done in a single PR to avoid breakages in PR builds.
Work item tracking
Microsoft ADO (number only): 25517845
How I did it
How to verify it
### Why I did it
[Security] Upgrade the OpenSSL/OpenSSH to fix CVE alerts
Upgrade OpenSSL to 1.1.1n-0+deb11u5
Fix CVEs:
CVE-2023-0464 (Excessive Resource Usage Verifying X.509 Policy
CVE-2023-0465 (Invalid certificate policies in leaf certificates are
CVE-2023-0466 (Certificate policy check not enabled).
CVE-2022-4304 (Timing Oracle in RSA Decryption).
CVE-2023-2650 (Possible DoS translating ASN.1 object identifiers).
Upgrade OpenSSH to 8.4p1-5+deb11u2
Fix CVEs:
CVE-2023-38408 (Lacks SSH agent restriction)
##### Work item tracking
- Microsoft ADO **(number only)**: 25506776
#### How I did it
Upgrade the OpenSSL/OpenSSH package version and fix the UT failure.
#### How to verify it
Verified by UTs with and without FIPS enabled.
How I did it
Update Yang definition of ACL_TABLE_TYPE.
Update existing testcase.
Add new testcase to cover lowercase key scenario.
How to verify it
Verified by building sonic_yang_models-1.0-py3-none-any.whl. While building the target package, unit tests were run and passed.
- Why I did it
Because the Spectrum4 devices don't support mlxtrace utility.
- How I did it
Edit sai.profile and remove mlxtrace_spectrum4_itrace_*.cfg.ext files
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
In #15080, there was a command added to re-add 127.0.0.1/8 to the lo
interface when the networking configuration is being brought down.
However, the trigger for that command is `down`, which, looking at
ifupdown2 configuration files, runs immediately after 127.0.0.1/16 is
removed. This means there may be a period of time where there are no
loopback addresses assigned to the lo interface, and redis commands will
fail.
Fix this by changing this to pre-down, which should run well before
127.0.0.1/16 is removed, and should always leave lo with a loopback
address.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: anamehra anamehra@cisco.com
Added a check for DEVICE_METADATA before accessing the data. This prevents the j2 failure when var is not available.
What I did:
Enable Sending BGP Community over internal neighbors over iBGP Session
Microsoft ADO: 25268695
Why I did:
Without this change BGP community send by e-BGP Peers are not carry-forward to other e-BGP peers.
str2-xxxx-lc1-2# show bgp ipv6 20c0:a801::/64
BGP routing table entry for 20c0:a801::/64, version 52141
Paths: (1 available, best #1, table default)
Not advertised to any peer
65000 65500
2603:10e2:400::6 from 2603:10e2:400::6 (3.3.3.6)
Origin IGP, localpref 100, valid, internal, best (First path received)
Last update: Tue Sep 26 16:08:26 2023
str2-xxxx-lc1-2# show ip bgp 192.168.35.128/25
BGP routing table entry for 192.168.35.128/25, version 52688
Paths: (1 available, best #1, table default)
Not advertised to any peer
65000 65502
3.3.3.6 from 3.3.3.6 (3.3.3.6)
Origin IGP, localpref 100, valid, internal, best (First path received)
Last update: Tue Sep 26 15:45:51 2023
After the change
str2-xxxx-lc2-2(config)# router bgp 65100
str2-xxxx-lc2-2(config-router)# address-family ipv4
str2-xxxx-lc2-2(config-router-af)# neighbor INTERNAL_PEER_V4 send-community
str2-xxxx-lc2-2(config-router-af)# exit
str2-xxxx-lc2-2(config-router)# address-family ipv6
str2-xxxx-lc2-2(config-router-af)# neighbor INTERNAL_PEER_V6 send-community
str2-xxxx-lc1-2# show bgp ipv6 20c0:a801::/64
BGP routing table entry for 20c0:a801::/64, version 52400
Paths: (1 available, best #1, table default)
Not advertised to any peer
65000 65500
2603:10e2:400::6 from 2603:10e2:400::6 (3.3.3.6)
Origin IGP, localpref 100, valid, internal, best (First path received)
**Community: 1111:1111**
Last update: Tue Sep 26 16:10:19 2023
str2-xxxx-lc1-2# show ip bgp 192.168.35.128/25
BGP routing table entry for 192.168.35.128/25, version 52947
Paths: (1 available, best #1, table default)
Not advertised to any peer
65000 65502
3.3.3.6 from 3.3.3.6 (3.3.3.6)
Origin IGP, localpref 100, valid, internal, best (First path received)
**Community: 1111:1111**
Last update: Tue Sep 26 16:10:09 2023
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Why I did it
Fix: #16699
Fast reboot is failing from old OS versions (eg., 201911 image) to latest (eg., master branch) after PR #15685
The system wide flag for FAST_REBOOT is still required when the base OS version does not support the new fast-reboot reconciliation logic (no db dump)
#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](https://github.com/sonic-net/sonic-buildimage/pull/11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`
In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.
#### How I did it
Change the log severity to the warning and changed the return value.
#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](47742dfc2c/files/image_config/monit/memory_checker (L139)) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'
INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
previously, get_num_asics() returns the maximum number of asics. however, the asic_count
should be actual number of asics populated which can be get from get_asic_presence_list().
ADO: 25158825
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Why I did it
Networking devices need to be responsive. Such responsiveness is harmed when the CPU change state.
There is a latency penalty when a CPU is idle (e.g C2) and need to exit this state to come back to C1 state.
To prevent this from happening the CPU should be forced to remain in C1 state.
How I did it
Generalize the cstate forcing to C1 to all Arista products.
This is done by adding processor.max_cstate=1 to the kernel cmdline for all CPUs.
Additionally Intel CPUs also need intel_idle.max_cstate=0 to fallback to the acpi_idle driver.
How to verify it
Check that processor.max_cstate=1 is present on the cmdline for AMD CPUs
Check that both processor.max_cstate=1 and intel_idle.max_cstate=0 are present on the cmdline for Intel CPUs
Openssh in Debian Bullseye has been updated to 1:8.4p1-5+deb11u2 to fix CVE-2023-38408.
Since we're building openssh with some patches, we need to update our version as well.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
#### Why I did it
src/sonic-linux-kernel
```
* 9534615 - (HEAD -> 202211, origin/202211) arm64: ac5: Fix watchdog timeleft (#334) (5 days ago) [pavannaregundi]
* 70c4df8 - [marvell-arm64]: Add support for 98DX35xx and 98CX85xx platform (#311) (6 days ago) [pavannaregundi]
* aab079e - [Mellanox] Upstream kernel patches with HW-MGMT 7.0030.1011 (#327) (4 weeks ago) [Kebo Liu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
#### Why I did it
src/linkmgrd
```
* abb22d2 - (HEAD -> 202211, origin/202211) [warmboot] config all interfaces back to `auto` if reconciliation times out (#220) (7 days ago) [Jing Zhang]
```
#### How I did it
#### How to verify it
#### Description for the changelog
#### Why I did it
src/sonic-swss
```
* 9647b81f - (HEAD -> 202211, origin/202211) [muxorch] Reorder the neighbor disable operations (#2917) (12 hours ago) [Longxiang Lyu]
* 30cea968 - Support type7 encoded CAK key for macsec in config_db (#2892) (5 days ago) [judyjoseph]
* 8d76a4e7 - [202211][ppi]: General code cleanup: remove unused methods. (#2868) (3 weeks ago) [Nazarii Hnydyn]
```
#### How I did it
#### How to verify it
#### Description for the changelog
* [Mellanox] Update HW-MGMT package to new version V.7.0030.1010
Signed-off-by: Kebo Liu <kebol@mellanox.com>
* Update hw-mgmt version to 7.0030.1011
Signed-off-by: Kebo Liu <kebol@nvidia.com>
---------
Signed-off-by: Kebo Liu <kebol@mellanox.com>
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
SONiC service determine-reboot-cause might run before driver creating reset cause files. In that case, the reset cause will be "Unknown". This PR introduces a wait mechanism to wait for reset cause sysfs files ready.
How I did it
/run/hw-management/config/reset_attr_ready is the file to indicate all reset cause files are ready. In chassis.get_reboot_cause function, it waits /run/hw-management/config/reset_attr_ready for up to 45 seconds.
How to verify it
Manual test on master/202211/202205
Why I did it
When SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT changed, almost all dockers need to be built again.
But currently it will be loaded by cache.
Work item tracking
Microsoft ADO (number only): 25123348
How I did it
Add $(DOCKER)_FILES into dependencies.
How to verify it