[lwip-users] LWIP ERR_ABRT after period of successful operation

Hello.

I am using LWIP 1.4.1 and FreeRTOS 7.3.0 on an Atmel SAME70Q21 processor and a Micrel KSZ8863 Ethernet switch chip in a custom networked product. We are at the point in development where we are deploying prototype units in various installations around the country, and we are seeing varying network behavior in the different locations.

In our development lab, which has a very simple network with very few users, our product can run for days or even weeks without network issues. In some locations, these are mainly hospitals with hundreds of users and the associated IT departments and software and security policies and firewalls etc. that come with that, the same firmware will run for a day or two at the most. In some locations the network code hangs as often as every 2-3 hours.

I have taken to capturing system performance statistics in an attempt to better describe symptoms and narrow in on solutions.

In our product, I generally do an HTTP GET to a user portal once per minute. I notice that tcpip_thread() runs 6 times for every GET/response. This seems to be consistent across all the deployment sites.

In our processor, I am able to count the number of Frame Receive Complete interrupts (GMAC_ISR_RCOMP) that are executed. Each site seems to have an ambient level of incoming frames it is processing, and this varies quite a bit from site to site. I can see that some of this traffic is ARP, none of it is pings unless we ping the equipment ourselves, and the only UDP/TCP traffic we see is DHCP (UDP:67) and DNS (UDP:53) at boot up and then HTTP (TCP:80) when I do the HTTP GET or HTTP POST. Each site has a varying number of ARP requests, between 1 and 30 per second, and a varying amount of other non-UDP/non-TCP IP traffic which I have not yet identified.

When the product hangs, the networking code hangs in different ways. The hang mechanism I am currently focused on is ERR_ABRT/ECONNABORTED/Software caused connection abort. Once this occurs, and it can take hours or days to happen depending on the site, the comm won’t work again until the appliance is power-cycled.

I see the post in the issue archive from 2012 that mentions MEMP_NUM_TCP_PCB being set to 5 and that number being too low for that application. I believe I only have one connection open at a time. I open the connection, issue the GET, and then close the socket, so I think the default setting of 5 should be 4 too many. I have bumped MEMP_NUM_TCP_PCB up from 5 to 10, and I have bumped MEMP_NUM_UDP_PCB up from 4 to 10. I don’t know if this will help or not. We have had our latest firmware running at multiple sites for a few days and it is difficult to tell yet if increasing the number of PCBs has fixed anything.

Thank you for any insight into this comm/hang issue and/or ideas how to narrow the problem further.

Best regards,

Julia

Julia Irvin, MSEE
Oxford Ridge Solutions Group, Inc.

Melbourne, FL

USA
address@hidden
(321) 543-7140

From:	Julia Irvin
Subject:	[lwip-users] LWIP ERR_ABRT after period of successful operation
Date:	Fri, 29 Sep 2017 14:00:26 -0400