Commit Graph

5134 Commits

Author SHA1 Message Date
Stephen Sun
acac848858
[Reclaim buffer][202012] Reclaim unused buffers by applying zero buffer profiles (#9063)
- Why I did it
Support zero buffer profiles

1. Add buffer profiles and pool definition for zero buffer profiles
2. Support applying zero profiles on INACTIVE PORTS
3. Enable dynamic buffer manager to load zero pools and profiles from a JSON file

- How I did it
Add buffer profiles and pool definition for zero buffer profiles

If the buffer model is static:
 * Apply normal buffer profiles to admin-up ports
 * Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
 * Apply normal buffer profiles to all ports
 * buffer manager will take care when a port is shut down

Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists

Originally, all the macros to generate the above buffer objects took active ports only as an argument.
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added.
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called.
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:

The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files

Update sonic-cfggen test accordingly:
 * Adjust example output file of JSON template for unit test
 * Add unit test in for Mellanox's new buffer templates.

- How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.

Signed-off-by: stephens <stephens@nvidia.com>
2021-12-09 17:34:56 +02:00
xumia
6a6512246d
[Bug][Build]: fix the file not found issue caused by the relative path (#9450)
#### Why I did it
Merged from master branch: https://github.com/Azure/sonic-buildimage/pull/9443
Fix the nodesource.list cannot read issue, it is cased by the full path not used.

```
2021-12-03T06:59:26.0019306Z Removing intermediate container 77cfe980cd36
2021-12-03T06:59:26.0020872Z  ---> 528fd40e60f6
2021-12-03T06:59:26.0021457Z Step 81/81 : RUN post_run_buildinfo
2021-12-03T06:59:26.0841136Z  ---> Running in d804bd7e1b06
2021-12-03T06:59:29.1626594Z DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
2021-12-03T06:59:34.2960105Z /usr/bin/sed: can't read nodesource.list: No such file or directory
2021-12-03T06:59:34.5094880Z The command '/bin/sh -c post_run_buildinfo' returned a non-zero code: 2
```
2021-12-07 23:42:33 -08:00
Jing Zhang
f93d1f64a2
[submodule]: Update submodule sonic-linkmgrd (#9421)
6c6151b Fix unstable unit tests (state change handler wasn't invoked) (#8)
2f7dc0a support code diff coverage (#5)
83f0002 Force mux state switch to standby if triggered from Cli (#6)

signed-off-by: Jing Zhang zhangjing@microsoft.com
2021-12-06 16:41:33 -08:00
Neetha John
8f15d39bae
[202012] [submodule] Update sonic-utilites pointer (#9410)
Contains the following commits
239cb5c  [flex counter] Flex counter threads consume too much CPU resources (Azure/sonic-utilities#1925)
8a3b41a [load_minigraph] Delay pfcwd start until the buffer templates are rendered  (Azure/sonic-utilities#1937)
2021-12-06 13:44:00 -08:00
kellyyeh
2019ccaa2a [radv] Run radv on MgmtToRRouter (#9424)
* Allow radv to run on mgmt tor and EPMS
2021-12-06 21:32:33 +00:00
xumia
5670f25416 [Build]: Fix tmpfs space not enough issue when building vs image (#9438)
Fix no space left on device issue in tmpfs.
2021-12-01T06:30:40.1651742Z cp: write error: No space left on device
2021-12-01T06:30:40.1652225Z Failure: local_fs_run():/dev/vdb Unable to copy /tmp/tmp.gl4Sgp/onie-installer.bin to tmpfs
2021-12-06 21:32:29 +00:00
xumia
a3f1769a34 [Build]: Cleanup the reproducible mirrors when build complete (#9132)
Why I did it
The reproducible build mirrors are only used during the build, the mirrors can be removed after that.
2021-12-06 21:32:24 +00:00
Volodymyr Samotiy
0831635b1c
[Mellanox] Update SDK to v4.4.3360 and FW to v2008.3358 (#9403)
- Why I did it
To include latest fixes.

1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
3. On rare occasions, when working with port rates of 1GbE or 10GbE and congestion occurs, packets may get stuck in the chip and may cause switch to hang.
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times ( up to 70 seconds).
6. When connecting SN4600C to SN4600C after Fastboot in 50GbE No_FEC mode with a copper cable, the link up time may take ~20 seconds.

- How I did it
Updated SDK submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "soni-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-06 11:01:43 +02:00
xumia
05351731d5
[Build]: Enable arm pr check #9362
Support marvell-armhf dpkg cache and the azp check.
Waiting for merging PR #9381 to 202012 branch, so only azp template change in this PR.

Move the VS build to a new stage BuildVS, change the Test stage only depending on BuildVS, running the BuildVS and the other platform's build in parallel. The Test stage do not has dependency on the marvel-armhf build, reduce the overall build time caused by longer build time of marvel-armhf build.
2021-12-04 09:31:53 +08:00
xumia
1e8fe7dd6c Fix armhf version issue (#9382)
Why I did it
Fix some of the version files not used issue.
One of example version file version-py3-all-armhf, when building marvell-armhf, the version is used as expected, but it not use.
2021-12-01 02:29:07 +00:00
Wirut Getbamrung
933454dc29 [device/celestica]: add controllable config to platform.json of e1031 (#9183) 2021-12-01 02:29:02 +00:00
Lawrence Lee
b3a3aa0c38 [mux]: Fix mark_dhcp_packet (#9373)
- Consolidate the two [Service] sections by moving the ExecStartPre line for mark_dhcp_packet.py to the first section and removing the second.
- Make the mark_dhcp_packet.py file executable
- Also clean up mark_dhcp_packet.py
    - Remove unused imports
    - Fix spacing and line lengths to conform to PEP8
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-12-01 02:28:56 +00:00
arlakshm
9f0fc89cff remove staticd.conf.j2 (#9182)
Why I did it
resolves #8979 and #9055

How I did it
Remove the file static.conf.j2,which adds the default route on eth0 from bgp docker

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-12-01 02:28:51 +00:00
Stephen Sun
fafd5327bd [Reclaim buffer] Common infrastructure update for reclaiming buffer (#9133)
- Why I did it
This is to update the common sonic-buildimage infra for reclaiming buffer.

- How I did it
Render zero_profiles.j2 to zero_profiles.json for vendors that support reclaiming buffer
The zero profiles will be referenced in PR [Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles #8768 on Mellanox platforms and there will be test cases to verify the behavior there.
Rendering is done here for passing azure pipeline.
Load zero_profiles.json when the dynamic buffer manager starts
Generate inactive port list to reclaim buffer

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-12-01 02:28:46 +00:00
Junchao-Mellanox
227f2f8aec [Mellanox] Fan speed should not be 100% when PSU is powered off (#9258)
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

- How I did it
When PSU is powered of, don't treat it as absent.

- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
2021-12-01 02:28:37 +00:00
xumia
d9fd39538b Support dpkg cache for marvell-armhf (#9381)
Why I did it
Support marvell-armhf dpkg cache
2021-11-30 13:11:12 +00:00
Junchao-Mellanox
b1162682cb
[system-health] [202012] No longer check critical process/service status via monit (#9367)
Backport https://github.com/Azure/sonic-buildimage/pull/9068 to 202012

#### Why I did it

Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

#### How I did it

1.	Get container names from FEATURE table
2.	For each container, collect critical process names from file critical_processes
3.	Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

#### How to verify it

1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
2021-11-24 15:36:14 -08:00
Jing Zhang
3e6cdfa3a6
[sonic-linkmgrd] submodule update (#9343)
Submodule update for sonic-linkmgrd
Incorporates:

c11a576 (2021-11-22 09:38:46) [ci]: show code coverage in azure pipeline (#4)
4ceb01d (2021-11-18 20:24:20) Fix MUX toggling issue (#1)
d640527 (2021-11-12 22:31:44) [ci]: fix artifact download
b9f247d (2021-11-12 22:31:44) [ci]: use native arm64/armhf build
3059122 (2021-09-27 11:32:23) [linkgrd] Add Missing Apache License Header

signed-off-by: Jing Zhang zhangjing@microsoft.com
2021-11-24 11:12:22 -08:00
tjchadaga
d3a5c5ccd0
[202012][sonic-sairedis] update submodule (#9364)
Update sonic-sairedis submodule to get the below fixes:

7389704 [202012] Add ACL_TABLE object to break before make list (Azure/sonic-sairedis#971)
f334349 Fix hung issue when installing linux kernel modules (Azure/sonic-sairedis#969)
2021-11-24 11:10:50 -08:00
xumia
415fd17689 [Build]: Fix the version not found issue (#9331)
When we update the a sai package downing from a remote server, we need to update the version file as well currently, but the reproducible build feature is not enabled in master, it can only be detected when merging the code into the release branches, such as 202106, 202012, etc.
The reproducible feature is to reduce the build failure, not need to break the build when the version not specified. If version not specified, the best choice is to accept the version from remote server.

Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
2021-11-24 01:16:37 +00:00
shlomibitton
2361e75cfb
[hostcfgd] [202012] Fix the delay type to 'boot' delay instead of a unit activation delay (#8896)
#### Why I did it
With current code the delay will take place even if simple 'config reload' command executed and this is not desired.
This delay should be used only when fast-rebooting.

#### How I did it
Change the type of delay to OnBootSec instead of OnActiveSec.

#### How to verify it
Fast-reboot with this PR and observe the delay.
Run 'config-reload' command and observe no delay is running.
2021-11-23 15:21:07 -08:00
Vivek Reddy
edd6b847e9
[hostcfgd] [202012] Fixed the brief blackout in hostcfgd using SubscriberStateTable (#9228)
#### Why I did it
Backporting https://github.com/Azure/sonic-buildimage/pull/8861 to 202012
2021-11-22 21:57:07 -08:00
Qi Luo
7fb0f3f89f
[redis-py]: Fix redis version during pip3 install (#9329)
The recent release of redis 4.0.0 or newer (for python3) breaks sonic-config-engine unit test. Fix to last known good version.

ref: https://pypi.org/project/redis/#history
2021-11-22 11:06:12 -08:00
liuh-80
a5bf6fd874
[sonic-utilities] submodule update (#9342)
Submodule update for sonic-utilities with following change:

ec9e5ee Backport [generate_dump] remove secrets from dump files #1886 to 202012 (#1938)
ce3b856 [fdbshow]: Handle FDB cleanup gracefully. (#1926)
1437bf2 [202012] Add DHCPv6 Relay counter and ipv6 helper CLI (#1917)
2021-11-22 09:45:10 -08:00
Renuka Manavalan
678100a7c4
Cherry pick of PR #9123 (#9310)
[cherry-pick PR #9123 ]

Why I did it
When sshd realizes that this login can't succeed due to internal device state
or configuration, instead of failing right there, it proceeds to prompt for
password, so as the user does not get any clue on where is the failure point.

Yet to ensure that this login does not proceed, sshd replaces user provided password
with a specific pattern of characters matching length of user provided password.
This pattern is "<BS><LF><CR><DEL>INCORRECT", which is bound to fail.

If user provided length is smaller/equal, the substring of pattern is overwritten.
If user provided length is greater, the pattern is repeated until length is exhausted.

But if the PAM-tacacs plugin would send this password to AAA, the user could get
locked out by AAA, for providing incorrect value.

How I did it
Hence this fix, matches obtained password against the pattern. If match, fail just before
reaching AAA server.

How to verify it
Make sure tacacs is properly configured.
Try logging in as, say "user-A"; ensure it succeeds
Pick another user, say user-B and ensure this user has not logged into this device before (look into /etc/passed & folders under /home)
Disable monit service (as that could fix the issue using disk_check.py)
Start TCP dump for all TACACS servers.
Simulate Read-only disk
Try logging in using user-B.
Verify it fails, after 3 attempts
Stop tcp dump.
TCP dump should show "authentication" for user-A only
2021-11-19 17:33:13 -08:00
Shilong Liu
4533e64fb4 [CI] Fix Azure pipeline set -e not work. (#9282)
In azure pipeline template 'set -e' not works as expected.
2021-11-20 00:45:00 +00:00
tjchadaga
d5f95e5b2b
[sonic-utilites] submodule update (#9324) 2021-11-19 13:03:26 -08:00
Prince Sunny
54a29b16c5
[swss] Update submodule for sonic-swss (#9314)
c31a362 - 2021-11-18 : [202012][Mux orch] set default as standby, change mux orch priority (#2015) [Prince Sunny]
9a9e8e6 - 2021-11-18 : [202012] Check VS test failure (#2033) [Prince Sunny]
7eaabca - 2021-11-11 : [202012] Fix random failure in PR/CI build. (#2016) [Shilong Liu]
85230fe - 2021-11-04 : [orchagent] Fix group name of port-buffer-drop in flexcounterorch.cpp (#1967) [Junchao-Mellanox]
a55c2ca - 2021-11-03 : [teammgrd]: Handle LAGs cleanup gracefully on Warm/Fast reboot. (#1934) [Nazarii Hnydyn]
2021-11-18 21:08:16 -08:00
Prince Sunny
d6ab409709
[202012] td2/td3 change cpu cos num to 10 (#9311)
Cherry-pick from #9301
2021-11-18 12:48:20 -08:00
vdahiya12
a7a4980e45
[202012][sonic-platform-common] submodule update (#9297)
6f198d0 (HEAD -> 202012, origin/202012) [Y-Cable][Broadcom] upgrade to support Broadcom Y-Cable API to release (#230)
1c3e422 SSD Health: Retrieve SSD health and temperature values from generic SSD info (#229)

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
2021-11-17 23:06:44 -08:00
gechiang
a5f4780c64
[202012] BRCM SAI 4.3.5.1-8 Pick up fix for PFCWD getting continuously triggered/restored when pause frames are sent continuously to both queues of a port (#9296)
1.  CS00012211718 [4.3] Pfcwd getting continuously triggered/restored when pause frames are sent continuously to both queues of a port (TD2/Th/Th2/TD3) MSFT Default

Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
```
     fib/test_fib.py
     vxlan/test_vxlan_decap.py
     fdb/test_fdb.py
     decap/test_decap.py
     ipfwd/test_dip_sip.py 
     ipfwd/test_dir_bcast.py
     acl/test_acl.py
     vlan/test_vlan.py
     platform_tests/test_reboot.py
```
2021-11-17 21:30:10 -08:00
Qi Luo
c87ec48993
[sonic-utilities] submodule update (#9266)
9dd3025 2021-05-11 | [Command-Reference.md] Document new SNMP show and config commands (#1600) [Travis Van Duyn]
be40767 2021-05-05 | [show][config] Add new snmp commands (#1347) [Travis Van Duyn]
2021-11-15 21:06:43 -08:00
trzhang-msft
19008889de update DHCP_PACKET_MARK schema (#9077)
- update DHCP_PACKET_MARK schema in state_db
- this is an update over PR: Add service mark_dhcp_packet to mux container #9015
2021-11-15 21:37:08 +00:00
trzhang-msft
86fa5eede2 Add service mark_dhcp_packet to mux container (#9015)
- add a new service "mark_dhcp_packet" to mux container
- apply packet marks on a per-interface basis in ebtables
- write packet marks to "DHCP_PACKET_MARK" table in state_db
2021-11-15 21:36:29 +00:00
Renuka Manavalan
6cb7af73d9 add arista.log to logrotate (#9245) 2021-11-15 21:32:03 +00:00
kellyyeh
2cbe6a7502 DHCPv6 Relay multivlan functionality support (#9178)
Fix support for DHCPV6 Relay multi vlan functionality. Make sure the relayed packet is received at correct interface.

How I did it
Bind a socket to each vlan interface's global and link-local address.
Socket binded to global address is used for relaying data from client to server and receiving data from servers.
Socket binded to link-local address is used for relaying data received from server back to the client.
2021-11-15 21:31:58 +00:00
mssonicbld
36f1a547b1
[ci/build]: Upgrade SONiC package versions (#9255) 2021-11-14 23:26:35 +00:00
mssonicbld
4d15a1c1f6
[ci/build]: Upgrade SONiC package versions (#9221) 2021-11-13 23:37:09 +00:00
gechiang
7ac5b40f4b
[202012]BRCM SAI 4.3.5.1-7 Picked up fixes for CS00012209390, CS00012212995, SONIC-51583, CS00012215744, and SONIC-51638 (#9252)
This is to pick up BRCM SAI 4.3.5.1-7 fixes which contains the following fixes:

1.  CS00012209390: SONIC-50037, Used SAI_SWITCH_ATTR_QOS_DSCP_TO_TC_MAP as a default decap map for IPinIP tunnels.
2.  CS00012212995: SONIC-50948 SAI_API_QUEUE:_brcm_sai_cosq_stat_get:1353 egress Min limit get failed with error Invalid parameter 
3.  SONIC-51583: Fixed acl group member creation failure with priority of -1
4.  CS00012215744:SONIC-51395 [TH, TH2] WB 3.5 to 4.3 fails at APPLY_VIEW while setting SAI_PORT_ATTR_EGRESS_ACL
5.  SONIC-51638: SDK-249337 ERROR: AddressSanitizer: heap-buffer-overflow in _tlv_print_array

Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
```
     fib/test_fib.py
     vxlan/test_vxlan_decap.py
     fdb/test_fdb.py
     decap/test_decap.py
     ipfwd/test_dip_sip.py 
     ipfwd/test_dir_bcast.py
     acl/test_acl.py
     vlan/test_vlan.py
     platform_tests/test_reboot.py
```
2021-11-13 10:45:46 -08:00
Mykhailo Onipko
a7117b905f
[BFN]: Updated SDK packages to 20211112 (#9244)
Signed-off-by: Mykhailo Onipko <monipko@barefootnetworks.com>
2021-11-12 21:47:56 -08:00
Qi Luo
2a7595169b
[sonic-swss-common] Update submodule (#9225)
ead0d5a 2021-11-10 | Exclude *.a files from python deb packages (#554) [Qi Luo]
3a660ac 2021-10-20 | Fix the option missing in kernel config issue (#541) [xumia]
2021-11-11 00:50:17 -08:00
Lawrence Lee
b027e87ffb [mux.service]: Remove pmon dependency (#9211)
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-11 02:56:27 +00:00
Lawrence Lee
f317d93cb0 Merged PR 4679112: [write_standby]: Ignore non-auto interfaces
[write_standby]: Ignore non-auto interfaces

* In the event that `write_standby.py` is used to automatically switchover interfaces when linkmgrd or bgp crashes, ignore any interfaces that are not configured to auto-switch

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00
Lawrence Lee
57ad50cfd9 Merged PR 4559560: [bgp]: Switch to standby if BGP container exits
[bgp]: Switch mux to standby if BGP container exits

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00
Lawrence Lee
6a9c709336 [write_standby]: Improve logging
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00
Lawrence Lee
77378b4364 [mux]: Call write_standby from host only
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00
Lawrence Lee
25712c712e [mux]: Make write_standby available on host
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>

[write_standby]: Cleanup and fix build

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00
Lawrence Lee
84cd0e9471 [mux]: Initialize all mux ports as standby
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00
Tamer Ahmed
18d1f65339 Merged PR 4813977: [mux] Update Service Install With SONiC Target
[mux] Update Service Install With SONiC Target

Recent PR grouped all SONiC service into sonic.taget. The install section
of mux.service was not update and this causes delays when using config
reload as the service failed state is not being reset.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-11-10 18:54:33 -08:00
Lawrence Lee
70fbd6826c Merged PR 4366316: [mux.service]: Bind to sonic.target
[mux.service]: Bind to sonic.target

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-10 18:54:33 -08:00