Bug 60872 - CPU pegged @ 100% w/ a lot of TCP sockets in CLOSE_WAIT state
Summary: CPU pegged @ 100% w/ a lot of TCP sockets in CLOSE_WAIT state
Status: NEEDINFO
Alias: None
Product: Runtime
Classification: Mono
Component: General (show other bugs)
Version: 5.4 (2017-06)
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
: 60871 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-11-27 12:12 UTC by Mindaugas
Modified: 2017-12-07 12:39 UTC (History)
3 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report for Bug 60872 on GitHub or Developer Community if you have new information to add and do not yet see a matching new report.

If the latest results still closely match this report, you can use the original description:

  • Export the original title and description: GitHub Markdown or Developer Community HTML
  • Copy the title and description into the new report. Adjust them to be up-to-date if needed.
  • Add your new information.

In special cases on GitHub you might also want the comments: GitHub Markdown with public comments

Related Links:
Status:
NEEDINFO

Description Mindaugas 2017-11-27 12:12:49 UTC
Hi, 

Using the script below, a service that is managed by mono run-time starts using all the CPU, has 200+ CLOSE_WAIT tcp connections which do not close, and becomes non responsive. The issue starts with a sudden increase in CPU usage and thread count (thread count increases slower than CPU but the onset point is the same - same time), an increase in CLOSE_WAIT() sockets durring that time is present (if close() syscall is issued w/ gdb then the execution returns back to normal);


Script:
#!/bin/bash
IP=$1;
ab -n 10000000 -c 100 -k http://$IP:55558/resourceIdEncoder/encodeCustomerId?CustomerId=256 &> /dev/null &
PID=$(echo $!);
sleep 58.5;
kill -2 $PID;


Steps taken:
- Launch the service with mono;
- Execute the scipt (above) choosing that service as a target;
- Wait till the problem appears (~ 32-26 hours);

Actual: 
- CLOSE_WAIT socket count is 200+ (a sudden increase is noticeable);
- CPU is pegged @ 100%


Version: 
[root@xxx ~]# /root/mono-debug-install-dir/bin/mono-sgen -V
Mono JIT compiler version 5.4.1.6 (tarball Tue Nov 14 04:43:11 EST 2017)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  amd64
        Disabled:      none
        Misc:          softdebug
        LLVM:          supported, not enabled.
        GC:            sgen (concurrent by default)


Thread info:
https://pastebin.com/raw/EmMTu0B5


GDB: info threads:
https://pastebin.com/raw/zLwXEedc


GDB: thread apply all bt:
https://pastebin.com/raw/0cJUw0Ha


GDB: mono_stacktrace 25 (for the Threadpool worker):
https://pastebin.com/raw/hteNxvCv


Guidence on how to further troubleshoot would be appreciated.

P.S. I can not share the service code that the issue appears with, I'm currently writing a simpler service that I could share with the community and wouldn't be proprietary. In the mean time it would be great to hear some feedback on possible inspection / troubleshooting options.
Comment 1 Mindaugas 2017-11-27 12:14:20 UTC
*** Bug 60871 has been marked as a duplicate of this bug. ***
Comment 2 Ludovic Henry 2017-11-27 16:28:09 UTC
Hi,

The issue might possibly be with either the threadpool or the socket async code. To troubleshoot both, could you please run the service with the `MONO_LOG_MASK=threadpool,io-threadpool`, `MONO_LOG_LEVEL=debug` and `MONO_LOG_DEST=trace.log` environment variables defined, and share the `trace.log` file here?

Thank you,
Comment 3 Mindaugas 2017-11-30 13:42:04 UTC
Currently trying the --aot options, debug build of the hosted code and also the options suggested above. A couple of days needed.
Comment 4 Mindaugas 2017-12-04 09:59:48 UTC
output.mlpd - is not really useful for debugging when the bug occurs rarely. The .mlpd file is being loaded fully into ram, and when it is +64G it is useless. 

[root@x ~]# mprof-report --traces --maxframes=2 output.mlpd
Killed
[root@x ~]# mprof-report --traces --maxframes=2 --time=0-360 output.mlpd
Killed
[root@x ~]# mprof-report --traces --maxframes=2 --time=0-6 output.mlpd
Comment 5 Mindaugas 2017-12-07 12:39:23 UTC
Hi, 

Here are the last 6000 lines.
The log output has stopped at this point, presumably soon when the issue appeared (not sure, since no timestamp is available there and my external socket monitoring tools were decommissioned due to moving of environments).

https://pastebin.com/raw/XuaFWjCe

Does this give any insight? Any further troubleshooting tips?

Thanks.