|
From: | Julia Irvin |
Subject: | [lwip-users] LWIP ERR_ABRT after period of successful operation |
Date: | Fri, 29 Sep 2017 14:00:26 -0400 |
Hello.
I am using LWIP 1.4.1 and FreeRTOS 7.3.0 on an Atmel SAME70Q21 processor
and a Micrel KSZ8863 Ethernet switch chip in a custom networked product. We are
at the point in development where we are deploying prototype units in various
installations around the country, and we are seeing varying network behavior in
the different locations.
In our development lab, which has a very simple network with very few
users, our product can run for days or even weeks without network issues. In
some locations, these are mainly hospitals with hundreds of users and the
associated IT departments and software and security policies and firewalls etc.
that come with that, the same firmware will run for a day or two at the most. In
some locations the network code hangs as often as every 2-3 hours.
I have taken to capturing system performance statistics in an attempt to
better describe symptoms and narrow in on solutions.
In our product, I generally do an HTTP GET to a user portal once per
minute. I notice that tcpip_thread() runs 6 times for every GET/response. This
seems to be consistent across all the deployment sites.
In our processor, I am able to count the number of Frame Receive Complete
interrupts (GMAC_ISR_RCOMP) that are executed. Each site seems to have an
ambient level of incoming frames it is processing, and this varies quite a bit
from site to site. I can see that some of this traffic is ARP, none of it is
pings unless we ping the equipment ourselves, and the only UDP/TCP traffic we
see is DHCP (UDP:67) and DNS (UDP:53) at boot up and then HTTP (TCP:80) when I
do the HTTP GET or HTTP POST. Each site has a varying number of ARP requests,
between 1 and 30 per second, and a varying amount of other non-UDP/non-TCP IP
traffic which I have not yet identified.
When the product hangs, the networking code hangs in different ways. The
hang mechanism I am currently focused on is ERR_ABRT/ECONNABORTED/Software
caused connection abort. Once this occurs, and it can take hours or days to
happen depending on the site, the comm won’t work again until the appliance is
power-cycled.
I see the post in the issue archive from 2012 that mentions
MEMP_NUM_TCP_PCB being set to 5 and that number being too low for that
application. I believe I only have one connection open at a time. I open the
connection, issue the GET, and then close the socket, so I think the default
setting of 5 should be 4 too many. I have bumped MEMP_NUM_TCP_PCB up from 5 to
10, and I have bumped MEMP_NUM_UDP_PCB up from 4 to 10. I don’t know if this
will help or not. We have had our latest firmware running at multiple sites for
a few days and it is difficult to tell yet if increasing the number of PCBs has
fixed anything.
Thank you for any insight into this comm/hang issue and/or ideas how to
narrow the problem further.
Best regards,
Julia
Julia
Irvin, MSEE
Oxford Ridge Solutions Group, Inc. Melbourne,
FL
USA address@hidden (321) 543-7140 |
[Prev in Thread] | Current Thread | [Next in Thread] |