Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
Backport form master
Appropriate PR on master: #7735
Appropriate PR on master #6444
Why I did it
PG drop counters should be enabled by default (merge from master)
After "config reload" or "docker swss restart" all counters were enabled even if they were disabled before
How I did it
1)Add PG drop counter enable option to dockers/docker-orchagent/enable_counters.py
2) Check if entry already exist before set default values
How to verify it
- install image and run counterpoll show CLI command and then you will see PG_STAT_DROP enabled
- Disable few counters
counterpoll pg-drop disable
counterpoll port disable
- Save and reload
config save
config reload
- Check enable status
* [ci] Set default ACR in UpgrateVersion/PR/official pipeline. (#10341)
Why I did it
docker hub will limit the pull rate.
Use ACR instead to pull debian related docker image.
How I did it
Set DEFAULT_CONTAINER_REGISTRY in pipeline.
* Add a config variable to override default container registry instead of dockerhub. (#10166)
* Add variable to reset default docker registry
* fix bug in docker version control
c3d9d8f2bcd364dc81cd4d9bec02666cef648b10 (HEAD -> 201911, origin/201911) API for getting all members from all VLANs (#106)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
What I did:
Added support to create route-map action set tag
when the the allow prefix list matches. The tag can ben define by user in
constants.yml.
Why I did:
Since for Allow List feature we call from base route-map allow-list route-map having set tag option provides way for base route-map to do match tag and take any further action if needed. Adding tag provide metadata that can used by base route-map
[201911][pfcwd] Avoid ingress drop by not attaching zero profiles when pfc storm is detected (#2279)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* fix allow list issue
Signed-off-by: stormliang <stormliang@microsoft.com>
* add the ipaddress in the install list
* add unit test
Co-authored-by: Ubuntu <azureuser@SONIC-SH-STORM-02.5pu3m0fajw1edcfltykk1gauxa.gx.internal.cloudapp.net>
Why I did it
Failed to remove part of configuration of bgp allowed prefix list. The details in #10141
How I did it
There are two issues:
In FRR, ipv6 default route is ::/0, but in the configuration, it is 0::/0, string comparison would be false, but why ipv4 failed to remove the allowed prefix list, ipv6 works? Looks into next one for the answer.
The current managers_allow_list doesn’t support removal part of the prefix list. But why IPv6 works in 1? It is because the bug for the IPv6 default route comparison, it would do the update no matter what is the operation (the code will compare the prefix list in the FRR and configuration db, if all configurations in db are presented in FRR, it do nothing, otherwise it will update the prefix list based on the configuration from db).
How to verify it
Follow the step in #10141
f91a9e6e07a43cae531cda019935de3221e0bb09 (HEAD -> 201911, origin/201911) Fix: not to use blocking get_all() after keys() (#255)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
- Why I did it
To include the fix for the issue of Modification of shared headroom on the fly can get to negative occupancy that leads to PFC been sent from the switch continuously.
- How I did it
Updated submodule pointer and version in relevant Makefile.
- How to verify it
Build an image and run tests from sonic-mgmt.
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
Why I did it
Porting changes from DellEMC: S6100 CPLD upgrade #4299 and DellEMC S6100 CPLD upgrade support #3834 to 201911 branch
Added CPLD upgrade support for DellEMC S6100 platform.
9ce4d19d5a199cffe2933d80e343a80ded398b4a (HEAD -> 201911, origin/201911) With the changes in PR:https://github.com/Azure/sonic-buildimage/pull/5289 access to redis unix socket is given to the redis group members. Many of sonic-util commands (especially in multi-asic) case use redis unix socket to connect to DB and thus those comamnd fails without providing sudo. This PR is continuation of PR: https://github.com/Azure/sonic-buildimage/pull/7002 where we default to use TCP for Redis if user is not root
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
1c12a4050fecabd88245c7aa64a61259bc00db3b (HEAD -> 201911, origin/201911)Allowing the first time FEC and AN configuration to be pushed to SAI (#1705) (#2196)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
What I did:
Updated Jinja Template to enable BGP Graceful Restart based on device role. By default it will be enable only if the device role type is TorRouter.
Why I did:-
By default FRR is configured in Graceful Helper mode. Graceful Restart is needed on T0/TorRouter only since the device can go for warm-reboot. For T1/LeafRouter it need to be in Helper mode only
* [warm boot finalizer] only wait for enabled components to reconcile
Define the component with its associated service. Only wait for components that have associated service enabled to reconcile during warm reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
When FECDisabled is set to true in minigraph.py, push 'fec' 'none' explicitly to config_db. When 'fec' is defined in port_config.ini do not override it with 'rs' for 100G
Backport of #7667 to 202012 branch.
What I did:
Backport FRR patch FRRouting/frr#8220 on FRR 7.2. Fixes the Issue FRRouting/frr#8213
Why I did:-
Because of this race-condition we saw GR getting triggered even though BGP shut is given on peer device.
How I verify:
After patching this fix GR is not triggered on doing BGP shut on peer.
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.
No issues would be expected from the setpci command. From the PCIe spec:
"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "
How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.
Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.
How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
Identify the bad password set by sshd and fail auth before sending to
AAA server, and hence avoid possible user lock out by AAA.
For more details, please refer the parent/original PR #9123
* [radv] Support multiple ipv6 prefixes per vlan interface (#9934)
* Radvd.conf.j2 template creates two copies of the vlan interface when there are more than one ipv6 address assigned to a single vlan interface. Changed the format to add prefixes under the same vlan interface block.
resolves#8979 and #9055
How I did it
Remove the file static.conf.j2,which adds the default route on eth0 from frr docker
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
a1830c1761087bdc1f7433ebbb8d0bdc419da0d3 (HEAD -> 201911, origin/master, origin/HEAD, origin/201911) Fix OpenAPI spec to be readable by autorest (#101)
94805a39ac0712219f7dc08faa2cfdbf371dd177 Identify and report Vnet GUID for conflicting VNI (#99)
4832dfd677de72edc44d4eb8c1b60cfad79a3355 Static route expiry if not specified as persistent (#98)
5cc4358fb67b9e2a0da9a6691064e41f97ebebc2 (master) Add support for overlay ECMP (#96)
6822a46197daef060b4d00dba5153b04b163c43f [CI] Set diff cover threshold to 50% (#97)
dcc826a1503060b9a07e4510b4f48331c49e87dd Add PR diff coverage (#95)
e842c5ff317c67919dcbcab3358143cb9a16c9dd Generate code coverage for Unit Tests (#94)
f9bbed3cb86a3bab9a07745096835dbdbe5a4db6 Convert Unit Tests from unittest framework to pytest framework (#93)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
[201911] Prevent other notification event storms to keep enqueue
unchecked and drained all memory that leads to crashing the switch
router (#981)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Why I did it
Fix some unreliability seen on emmc device with some AMD CPUs
How I did it
Added a kernel parameter to add quirks to
It depends on a sonic-linux-kernel change to work properly but will be a no-op without it.
Description for the changelog
Add emmc quirks for Upperlake
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.
- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable
- How to verify it
1. Manual test
2. Regression
Why I did it
To incorporate the below changes in DellEMC S6100, S6000 platforms.
Enable thermalctld
Backport Platform API changes from master branch.
How I did it
Remove 'skip_thermalctld:true' in pmon_daemon_control.json
Implement the platform API methods in the respective device files
How to verify it
Verified that platform data is displayed by show platform fan and show platform temperature commands.
Why I did it
Cannot retrieve and display the reboot-cause.
How I did it
Correct the platform initialization definition.
How to verify it
Manual reboot and then 'show reboot-cause'
Backport #9258 to 201911
Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.
How I did it
When PSU is powered of, don't treat it as absent.
How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
Conflicts:
platform/mellanox/mlnx-platform-api/sonic_platform/thermal_infos.py
- Why I did it
To include latest fixes.
1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
3. On rare occasions, when working with port rates of 1GbE or 10GbE and congestion occurs, packets may get stuck in the chip and may cause switch to hang.
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times ( up to 70 seconds).
6. When connecting SN4600C to SN4600C after Fastboot in 50GbE No_FEC mode with a copper cable, the link up time may take ~20 seconds.
- How I did it
Updated SDK submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "soni-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
3ce811960f19c514a6ca0b1c611b2c453eb3a0a3 (HEAD -> 201911, origin/201911) [201911][port2alias]: Fix to get right number of return values (#1907)
e648290b51fa4ec4d465efe55aa4d27d16edb249 disk_Check: Scan & mount as RW when disk turns into Read-only (#1872)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Commits on Oct 26, 2021
Remove exec from platform_reboot call to prevent reboot hang (#1881) 066b5adf6d737a5bd174123d4d00dab4b6110cf6
Commits on Nov 17, 2021
[fdbshow]: Handle FDB cleanup gracefully. (#1918) c80321c98d0741f340d2900108bad7fed76c80cd