Commit Graph

7613 Commits

Author SHA1 Message Date
Saikrishna Arcot
f84dfd2345
Re-add 127.0.0.1/8 when bringing down the interfaces (#15080)
* Re-add 127.0.0.1/8 when bringing down the interfaces

With #5353, 127.0.0.1/16 was added to the lo interface, and then
127.0.0.1/8 was removed. However, when bringing down the lo interface,
like during a config reload, 127.0.0.1/16 gets removed, but 127.0.0.1/8
isn't added back to the interface. This means that there's a period of
time where 127.0.0.1 is not available at all, and services that need to
connect to 127.0.01 (such as for redis DB) will fail.

To fix this, when going down, add 127.0.0.1/8. Add this address before
the existing configuration gets removed, so that 127.0.0.1 is available
at all times.

Note that running `ifdown lo` doesn't actually bring down the loopback
interface; the interface always stays "physically" up.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-06-13 18:45:39 -07:00
Lior Avramov
c05d017091
[Mellanox] Remove iproute2 SDK patches from SONiC tree and consume them from SDK github (#15062)
- Why I did it
SDK patches for iproute2 were added to SONiC tree as a temporary solution.
Now that SDK with the patches is available, I have removed the patches from SONiC tree and we consume them from SDK github during compilation.

- How I did it
During build we download SDK iproute2 patches from SDK github (or from the URL provided by user if compiling SDK from sources) and apply them before compilation.

- How to verify it
Compile and load on switch, verify interfaces network devices created successfully.
Verify LLDP shows connections to neighbors.
Verify ping between 2 hosts over 2 router ports is successful.
2023-06-13 15:17:52 +03:00
Stephen Sun
238e6ffcc1
[Mellanox] Adjust warning threshold implementation according to the latest algorithm update (#15092)
- Why I did it
Adjust the warning threshold implementation according to the latest algorithm update

- How I did it
Modify power warning and critical thresholds methods

- How to verify it
Unit test updated to cover the change

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-06-13 15:14:10 +03:00
Kebo Liu
3cb13226be
Update SN5600 platform.json with service port sfp (#15337)
Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-06-13 14:15:15 +03:00
mssonicbld
1343b1eba3 [submodule] Update submodule sonic-platform-daemons to the latest HEAD automatically 2023-06-13 18:32:53 +08:00
mssonicbld
d7e75f48bf [submodule] Update submodule sonic-host-services to the latest HEAD automatically 2023-06-13 16:32:51 +08:00
mssonicbld
2227365107 [submodule] Update submodule sonic-mgmt-common to the latest HEAD automatically 2023-06-13 16:32:46 +08:00
mssonicbld
9ddb9d6852 [submodule] Update submodule sonic-mgmt-framework to the latest HEAD automatically 2023-06-13 16:32:42 +08:00
mssonicbld
713a8a8a7e [submodule] Update submodule sonic-swss to the latest HEAD automatically 2023-06-13 16:32:34 +08:00
jingwenxie
54a1ad10f9
[yang] Change asn to start from 0 for bgp monitor (#15350)
#### Why I did it
The asn 0 in BGP_MONITOR is invalid by YANG definition. However, the asn 0 in BGP_MONITOR is found in many devices. 
It was introduced by minigraph where its value is set to 0.
To unblock Config Updater test, the short term fix is to accept the asn 0 in BGP_MONITOR. 
We can revert this after NGS team make all the ASN change in minigraph.
##### Work item tracking
- Microsoft ADO **(24186140)**:

#### How I did it
Change the range
#### How to verify it
Unit test.
2023-06-12 21:59:34 -07:00
Hua Liu
05f1a5a31e
Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429)
Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-12 17:53:54 -07:00
Alpesh Patel
633fff8c10
enable ethernet backplane port support in port config for packet mode T2 devices (#14533)
For T2 systems using packet mode, the backplane interfaces (Ethernet-BP#) and the fabric card ethernet interfaces are not visible as neighbor interfaces.
In packet mode, these interfaces needs qos and buffer config as well.
This fix addresses that issue and adds the backplane interfaces to the PORTS_ACTIVE list
2023-06-12 14:02:22 -07:00
mssonicbld
cb9d9e57a6
[ci/build]: Upgrade SONiC package versions (#15431)
Upgrade SONiC Versions
2023-06-12 22:27:29 +08:00
mssonicbld
c74629a83a [submodule] Update submodule sonic-utilities to the latest HEAD automatically 2023-06-12 16:32:51 +08:00
mssonicbld
6b9c100974 [submodule] Update submodule sonic-host-services to the latest HEAD automatically 2023-06-11 16:32:32 +08:00
mssonicbld
50238d8039 [submodule] Update submodule sonic-platform-common to the latest HEAD automatically 2023-06-11 16:32:27 +08:00
mssonicbld
a45595158b
[ci/build]: Upgrade SONiC package versions (#15345) 2023-06-10 20:38:13 +08:00
mssonicbld
df20467b29
[submodule] Update submodule sonic-swss-common to the latest HEAD automatically (#15425) 2023-06-10 17:03:02 +08:00
mssonicbld
7f3d68f4c2 [submodule] Update submodule sonic-gnmi to the latest HEAD automatically 2023-06-10 16:32:55 +08:00
mssonicbld
bad9099fba [submodule] Update submodule linkmgrd to the latest HEAD automatically 2023-06-10 16:32:50 +08:00
mssonicbld
5c18870688
[submodule] Update submodule sonic-sairedis to the latest HEAD automatically (#15402) 2023-06-10 16:30:05 +08:00
mssonicbld
a48a813d08
[submodule] Update submodule sonic-utilities to the latest HEAD automatically (#15370) 2023-06-10 16:17:01 +08:00
mssonicbld
dc4eb9e90d
[submodule] Update submodule sonic-ztp to the latest HEAD automatically (#15426) 2023-06-10 16:05:44 +08:00
mssonicbld
e662c480dc
[submodule] Update submodule sonic-swss to the latest HEAD automatically (#15403) 2023-06-10 15:57:18 +08:00
mssonicbld
516e7930b2
[submodule] Update submodule sonic-platform-daemons to the latest HEAD automatically (#15401) 2023-06-10 15:30:27 +08:00
Liping Xu
78c41a1e58
allow docker_inram to kernel cmd list (#15374)
Why I did it
After docker_inram is enabled, the docker folder's default max size is 1.5G.
It's not big enough for some tests which need to install additional docker images or install extra packages.

Work item tracking
Microsoft ADO 24199761:
How I did it
add docker_inram into cmdline_allowlist

How to verify it
sudo sh -c 'echo "docker_inram_size=3000M" >> kernel-cmdline-append'
sudo reboot and check the docker folder size
2023-06-10 14:19:44 +08:00
Sudharsan Dhamal Gopalarathnam
162856ad9a
[sflow]Delay starting sflow service until ports are created (#15333)
* [sflow]Delay starting sflow service until ports are created
* Removing sflow from sonic.target dependency since it will be managed by hostcfgd
2023-06-09 16:28:15 -07:00
Saikrishna Arcot
d466994e91
teamd: Add support for custom retry counts for LACP sessions (#13453)
Why I did it
This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime.

How I did it
How to verify it
Tested manually with these cases:

Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect
Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds)
Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason
Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count)
Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152.


Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-06-09 10:03:25 -07:00
mssonicbld
2b5c0dd0c6
[submodule] Update submodule sonic-swss-common to the latest HEAD automatically (#15404) 2023-06-09 15:57:30 +08:00
Ye Jianquan
cec9d7b83a
Revert "Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)" (#15390)
This reverts commit 44427a2f6b.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
2023-06-09 09:10:35 +08:00
Arvindsrinivasan Lakshmi Narasimhan
0f194c5a03
set the default value for the port fec to RS on J2 based LC (#15346)
Why I did it
Work item tracking
Microsoft ADO (24182162):
How I did it
update the config.bcm to set the default fec RS 100G Linecard

How to verify it
Tests on chassis
2023-06-08 11:08:48 -07:00
Vivek
9d8ab1b8e4
[Mellanox] Added patchwork link to commit message (#15301)
- Why I did it
Add the patchwork link to the commit description for non-upstream patches if present

- How I did it
Parse the patchwork/<patch_name>.txt file from hw-mgmt
2023-06-08 18:51:58 +03:00
Liu Shilong
96cac8e918
[ci] Add marvell-arm64 build in PR checks. (#15356)
Why I did it
Add marvell-arm64 platform build in PR checks to avoid build break.

Work item tracking
Microsoft ADO (number only): 17257160
How I did it
How to verify it
2023-06-08 09:40:20 +08:00
Ikki Zhu
9fcbd5ed1d
fix possible cpld race access issue (#15371)
Why I did it
fix possible cpld race read issue between watchdog and reboot cause
process

How I did it
Use fcntl.flock to limit parallel access to cpld sys file

How to verify it
It can be simulated and verified with following python script

``` python3
import fcntl
import signal
import threading

exit_flag = False

def get_cpld_reg_value(getreg_path, register):
    file = open(getreg_path, 'w+')
    # Acquire an exclusive lock on the file
    fcntl.flock(file, fcntl.LOCK_EX)

    try:
        file.write(register + '\n')
        file.flush()

        # Seek to the beginning of the file
        file.seek(0)

        # Read the content of the file
        result = file.readline().strip()
    finally:
        # Release the lock and close the file
        fcntl.flock(file, fcntl.LOCK_UN)
        file.close()

    return result

def cpld_read(thread_num, cpld_reg, expect_val):
    while not exit_flag:
        val
= get_cpld_reg_value("/sys/devices/platform/dx010_cpld/getreg",
cpld_reg)
        #print(f"Thread {thread_num}: get cpld reg {cpld_reg}, value
{val}")
        if val != expect_val:
            print(f"Thread {thread_num}: get cpld reg {cpld_reg}, value
{val}, expect_val {expect_val}")

def signal_handler(sig, frame):
    global exit_flag
    print("Ctrl+C detected. Quitting...")
    exit_flag = True

if __name__ == '__main__':
    # Register the signal handler for Ctrl+C
    signal.signal(signal.SIGINT, signal_handler)

    t1 = threading.Thread(target=cpld_read, args=(1, '0x103', '0x11',))
    t2 = threading.Thread(target=cpld_read, args=(2, '0x141', '0x00',))
    t1.start()
    t2.start()
    t1.join()
    t2.join()
```
2023-06-07 11:29:18 -07:00
Yevhen Fastiuk
8a6d45227e
[Clock] Add timezone config YANG model (#14651)
* Add the ability to configure timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add YANG model for timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add timezone reference

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

---------

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
2023-06-07 10:39:24 -07:00
abdosi
6139c525d2
updated internal route policy for chassis-packet (#15349)
What I did:

Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state

Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops

For VOQ chassis there is no e-BGP peer (connected route via bgp )  resolution as route is added as Static route by orchagent over Ethernet-IB.

Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.

Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507

How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-06-07 09:17:44 -07:00
Rajkumar-Marvell
94790bef04
[sflow] Add egress sflow support. (#14630)
* [sflow] Add egress sflow support.
- Updated sonic-yang-model
- change hsflowd version to 2.0.45
2023-06-06 11:23:39 -07:00
mssonicbld
084d012749 [submodule] Update submodule sonic-mgmt-common to the latest HEAD automatically 2023-06-06 16:32:12 +08:00
mssonicbld
40eb97c2f3
[submodule] Update submodule sonic-utilities to the latest HEAD automatically (#15294) 2023-06-06 14:44:24 +08:00
mssonicbld
f78261cbac
[submodule] Update submodule sonic-swss to the latest HEAD automatically (#15355) 2023-06-06 14:34:37 +08:00
mssonicbld
ac56598db1 [submodule] Update submodule dhcprelay to the latest HEAD automatically 2023-06-06 14:33:15 +08:00
mssonicbld
ba241bbe3f [submodule] Update submodule sonic-platform-daemons to the latest HEAD automatically 2023-06-06 14:33:06 +08:00
Hua Liu
44427a2f6b
Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)
This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-05 22:21:17 -07:00
siqbal1986
381cfe4485
Added VNET_MONITOR_TABLE,BFD_SESSION_TABLE,VNET_ROUTE_TUNNEL_TABLE to the list (#14992)
* The 3 tables in state DB need to be cleaned up after SWSS restart for have consistant state.
2023-06-05 13:18:50 -07:00
Kalimuthu-Velappan
2627dcc5b4
07.Version Cache - Support for PIP (#14613)
During build, lots of pip packages are getting installed through pip
install command.

This feature adds support for caching all the pip packages into local
cache path, so that subsequent build always loads from the cache.
2023-06-05 12:02:33 -07:00
Marty Y. Lok
d4a81ea121
[Nokia-IXR7250E][Devicedata] update the device data for Nokia IXR7250E platform (#15216)
Why I did it
Update the device data files to support 1024 LAGs for Nokia IXR7250E platform
fixes https://github.com/Nokia-ION/ndk/issues/15

How I did it
Update the lag_id_end=1024 in chassisdb.conf file and add the trunk_group_max_members=16 in the BCM config file

How to verify it
check to allow to create lag ids up to 1024 with 16 port members

Signed-off-by: mlok <marty.lok@nokia.com>
2023-06-05 12:02:05 -07:00
Kalimuthu-Velappan
9dce453552
06.Version Cache - Support for wget (#14612)
When a package is referenced from the web through wget command,
it downloads the package for every build.

This feature caches all the packages that are being downloaded from the
web, so that subsequent build always loads the cache instead of from
web.
2023-06-05 12:00:58 -07:00
Aravind Mani
b26445cf7b
Dell FPGA driver fix (#15144)
Why I did it
FPGA driver crash was observed in Dell FPGA based platforms.

How I did it
Fixed FPGA crash

How to verify it
Load FPGA driver and check whether the kernel crashes.
2023-06-05 11:01:46 -07:00
mssonicbld
4335690de7 [ci/build]: Upgrade SONiC package versions 2023-06-05 20:51:47 +08:00
Liu Shilong
d2915969d4
[ci] Add OVERRIDE_BUILD_OPTIONS in image build template. (#15309)
Why I did it
Set build options in pipeline UI.
Support setting reproducible build options to py2,py3 in release branch and none in master branch.

Work item tracking
Microsoft ADO (number only): 22335854
How I did it
How to verify it
2023-06-05 18:42:06 +08:00