Bug 18098 - Mono runtime SIGSEGV while xsp2 is handling ThreadAbortException
Summary: Mono runtime SIGSEGV while xsp2 is handling ThreadAbortException
Status: NEEDINFO
Alias: None
Product: Runtime
Classification: Mono
Component: JIT (show other bugs)
Version: 3.2.x
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2014-02-28 12:43 UTC by Tom
Modified: 2017-10-14 00:36 UTC (History)
6 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Trace log from the crashing server (24.96 KB, text/plain)
2014-03-04 11:09 UTC, Tom
Details
Better stack trace from customer core (37.11 KB, text/plain)
2014-03-10 16:13 UTC, Tom
Details
A trace from mono 3.2.8 (5.38 KB, application/octet-stream)
2014-03-24 11:47 UTC, Tom
Details

Description Tom 2014-02-28 12:43:47 UTC
I am running xsp2 2.10.2 with mono 3.2.6. The install is packaged as an RPM built from the 3.2.6 tarball (with the unofficial Makefile patch that makes it possible to build the 3.2.6 tarball.) I'm not sure what I need to do to get full gdb symbols loaded for the stack trace and present on the system. It's a customer system so this will be interesting.

About once per day, on fairly busy system, the mono runtime will give a seg fault. The fault always happens inside XSPWorker::RunInternal called from GetApplicationForPath, and always seems to happen when the ThreadAbortException is thrown. 

Sometimes the stack trace happens after domain unloading. these appear without a managed stack dump:

Domain is unloading, not processing the request.
Stacktrace:


Native stacktrace:

	/usr/bin/mono [0x8100833]
	/usr/bin/mono [0x8151801]
	/usr/bin/mono [0x805b704]
	[0x419440]
	/usr/bin/mono [0x81134af]
	/usr/bin/mono [0x8151cde]
	/usr/bin/mono [0x80ff3fe]
	/usr/bin/mono [0x80ff77e]
	/usr/bin/mono [0x810147d]
	/usr/bin/mono [0x81518c6]
	[0xa3114c]

Debug info from gdb:

Listening on port: 8081 (non-secure)
System.Threading.ThreadAbortException: Thread was being aborted
  at Mono.WebServer.ApplicationServer.GetApplicationForPath (System.String vhost, Int32 port, System.String path, Boolean defaultToRoot) [0x00000] in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) Mono.WebServer.ApplicationServer:GetApplicationForPath (string,int,string,bool)
  at Mono.WebServer.XSPWorker.RunInternal (System.Object state) [0x00000] in <filename unknown>:0 
System.Threading.ThreadAbortException: Thread was being aborted
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
  at <0x00000> <unknown method>
System.Threading.ThreadAbortException: Thread was being aborted
  at Mono.WebServer.ApplicationServer.GetApplicationForPath (System.String vhost, Int32 port, System.String path, Boolean defaultToRoot) [0x00000] in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) Mono.WebServer.ApplicationServer:GetApplicationForPath (string,int,string,bool)
  at Mono.WebServer.XSPWorker.RunInternal (System.Object state) [0x00000] in <filename unknown>:0 
System.Threading.ThreadAbortException: Thread was being aborted
  at Mono.WebServer.ApplicationServer.GetApplicationForPath (System.String vhost, Int32 port, System.String path, Boolean defaultToRoot) [0x00000] in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) Mono.WebServer.ApplicationServer:GetApplicationForPath (string,int,string,bool)
  at Mono.WebServer.XSPWorker.RunInternal (System.Object state) [0x00000] in <filename unknown>:0 
Stacktrace:


Native stacktrace:

	/usr/bin/mono [0x8100833]
	/usr/bin/mono [0x8151801]
	/usr/bin/mono [0x805b704]
	[0x3fb440]
	/usr/bin/mono [0x81134af]
	/usr/bin/mono [0x8151cde]
	/usr/bin/mono [0x80ff3fe]
	/usr/bin/mono [0x80ff77e]
	/usr/bin/mono [0x810147d]
	/usr/bin/mono [0x81518c6]
	[0xb28781]

Debug info from gdb:


When there *is* a stack trace, it always ends with GetApplicationForPath/RunInternal, and is always a ThreadAbortException:

I will see about getting better stack dumps with debug symbols asap.
Comment 1 Tom 2014-02-28 16:31:47 UTC
On further examination of the trace log, it looks like this frequenty happens around the message "Domain is unloading, not processing the request." In general, there's some sort of crash-and-burn bug around the domain unload code. 

I'm pretty sure the mono executables are not stripped, so I don't know why there are no symbols in the stack dump.
Comment 2 Chris Hamons 2014-02-28 17:23:05 UTC
Tom, if you follow the instructions here:

http://www.mono-project.com/ASP.NET#Debugging

"By default xsp and xsp2 run in Release mode, which means that debugging line-number information will not be available in stack traces when errors occur.

To obtain line numbers in stack traces you need to do two things:

1. Enable Debug code generation in your page. 2. Run Mono with the --debug command line option.

You must enable debug code generation in your page using the Debug="true" in the top of your page, or setting the compilation flag in Web.config (compilation option).

Use the --debug command line option to Mono, this is done by setting the MONO_OPTIONS environment variable, like this:

$ MONO_OPTIONS=--debug xsp2
Listening on port: 8080 (non-secure)
Listening on address: 0.0.0.0
Root directory: /tmp/us
Hit Return to stop the server.
To do the same with the Apache mod_mono module, use the MonoDebug true directive in your apache configuration file."

you should be able to get some symbols for the stack dump.
Comment 3 Tom 2014-02-28 18:14:13 UTC
Thanks, I guess line numbers for the managed code could be useful. I'll turn that on, of course. I'll also ask the customer to install gdb. All symbolic information is present mono-sgen is not stripped.

The runtime dumps core "randomly", and it appears to do this while managing the app domains. This corresponds to behavior using the 2.10.x runtime, where the threads apparently hang, one by one, until the app stops working.

So I'm very interested in what's happening in the native code.

Unfortunately, this has proved impossible to reproduce outside of a high volume site.
Comment 4 Tom 2014-02-28 18:41:03 UTC
Note: I think almost everything in the log file is a red herring except for the native stacktrace.
Comment 5 Zoltan Varga 2014-02-28 20:15:24 UTC
It would help to get some kind of native backtrace.
Comment 6 Tom 2014-03-04 11:09:09 UTC
Created attachment 6211 [details]
Trace log from the crashing server
Comment 7 Tom 2014-03-04 11:11:22 UTC
I have attached a trace log containing gdb dumps. Threading code, sleeps, and reads. There's really not much going on in the app besides handling a web request and passing it on to a remoting interface...
Comment 8 Tom 2014-03-04 13:14:20 UTC
Maybe we can try to get a core dump?
Comment 9 Zoltan Varga 2014-03-04 14:06:51 UTC
We would need some kind of test case to be able to debug this. Also, please try a recent mono version,  like 3.2.7 or the upcoming 3.4.0, maybe the problem is already fixed there.
Comment 10 Tom 2014-03-04 15:40:46 UTC
I'm unable to reproduce the bug outside of a live site. I get regular, chronic crashing at two different customer sites and an internal server, but it only appears to happen in high traffic servers with real remote systems making the connections. We're currently modifying our staging server to attempt to recreate these conditions, but I doubt it will reduce to a simple test case. :) I'll do my best.
 
I will try an unstable release candidate (or trunk), if I can reproduce the bug on a test system, or if I become desperate enough to deploy trunk builds at a customer site. 

The customer is going to send me some core dumps, for what it's worth.
Comment 11 Tom 2014-03-10 16:13:26 UTC
Created attachment 6274 [details]
Better stack trace from customer core

bt + bt full of all threads from a customer supplied core dump. Thread 1 appears to be trying to free(0) in g_free. Hey, it's something. :)
Comment 12 Tom 2014-03-10 16:20:38 UTC
It's not very likely the pointer was 0 (if ptr != NULL) but memory management is a suspect now. Still trying to reproduce the bug outside of live sites.
Comment 13 Zoltan Varga 2014-03-10 16:22:43 UTC
The crash does seem to happen in thread 1, however the stack trace seems
corrupted, there are lot of functions there which don't make much sense. The
app seems to be doing an appdomain unload/gc at the same time.
Comment 14 Tom 2014-03-20 12:31:30 UTC
Yes I see what you mean. Well, these are from core files; perhaps the core files aren't quite right. I hate working with core files!

We updated the client to mono 3.2.8 tuesday, and they've been getting about two crashes per day. I'm still waiting for them to upload some core files. I saved the build dir, so these should be as close as possible.
Comment 15 Tom 2014-03-24 11:47:38 UTC
Created attachment 6398 [details]
A trace from mono 3.2.8

I am more confident of this stack trace, as it is definitely built from the debug tree in which I dumped the stack. Thread 15 has an interesting stack. I have requested more cores to see if there is a recurring pattern. (They should be getting several per day, but it's like pulling teeth.)
Comment 16 Zoltan Varga 2014-03-24 17:35:03 UTC
The crash happens in thread 1:

#6  <signal handler called>
#7  0x082028a9 in mono_g_hash_table_lookup_extended (hash=0xb0c64008, 
    key=0xb7886e98, orig_key=0xb34fd784, value=0xb34fd780) at mono-hash.c:276
#8  0x08202966 in mono_g_hash_table_lookup (hash=0xb0c64008, key=0xb7886e98)
    at mono-hash.c:260
#9  0x0821f968 in mono_image_get_fieldref_token (assembly=0xb0c4f890, 
    f=0xb7886e98, field=0xa28ac7c) at reflection.c:2748
#10 0x08220ed9 in mono_image_create_token (assembly=0xb0c4f890, 
    obj=0xb7886e98, create_open_instance=1, register_token=1)
    at reflection.c:5020
#11 0x0818560f in ves_icall_ModuleBuilder_getToken (mb=0xb0639870, 
    obj=0xb7886e98, create_open_instance=1) at icall.c:1159

Never seen this one before.
Comment 17 Tom 2014-03-24 18:03:53 UTC
I can't thank you enough for looking at this. But clearly beer, at the very least, is in order.

Walking up the stack, we are apparently on the "hashcode =" line:


	equal = hash->key_equal_func;

	hashcode = ((*hash->hash_func) (key)) % hash->table_size;
	
	for (s = hash->table [hashcode]; s != NULL; s = s->next){
		if ((*equal)(s->key, key)){

s appears to be optimized out, but if gdb is telling the truth, we didn't get there. Most everything appears to be OK, although I'm totally out clueless as to what correct/typical value ranges are for virtual heap/function addresses.

#7  0x082028a9 in mono_g_hash_table_lookup_extended (hash=0xb0c64008, key=0xb7886e98, orig_key=0xb34fd784, 
    value=0xb34fd780) at mono-hash.c:276
(gdb) p hash
$1 = (MonoGHashTable *) 0xb0c64008
(gdb) p key
$2 = (const void *) 0xb7886e98
(gdb) p hash->table_size
$3 = 21
(gdb) p hash->hash_func
$4 = (GHashFunc) 0x9fb799c
(gdb) info locals
equal = 0xaf25ce04
s = <value optimized out>
(gdb)p hashcode
$5 = <value optimized out>
(gdb) p *hash
$6 = {hash_func = 0x9fb799c, key_equal_func = 0xaf25ce04, table = 0xaeffef38, table_size = 21, in_use = 167475612, 
  threshold = -1356476924, last_rehash = -1358959240, value_destroy_func = 0x15, key_destroy_func = 0xa09c8b4, 
  gc_type = 2938490464}

There is exactly one thing I'm doing that could cause threading issues. I'm using Windows Remoting (it's based on ancient code!). I am careful not to use the remoting proxy object across threads, but browsing through the code today, I see that the same object is returned to any caller from a static hash, regardless of the thread. So... the question of whether a Remoting TransparentProxy object is truly thread safe is an issue here. (The singlecall object on the server side is entirely thread-safe.)

I am trying to deploy a test version of the software that (in a fairly brute force way) removes all multithreaded access to the remoting objects, as a test. Assuming that the customer eventually gets around to installing it, I will report the results of that test as well.
Comment 18 Tom 2014-03-27 11:39:01 UTC
According to the customer, this is the only core dump we have had since switching to the 3.2.8 runtime. xsp2 still becomes unresponsive about twice per day and requires restarting, but usually does not SEGV. An improvement, I guess. :)
Comment 19 Tom 2014-04-01 12:03:38 UTC
Got some more core files. Core file results are interesting but inconsistent. Always in thread 1, with thread 2 somewhere in GC tasks. Other threads seem unremarkable, lots of skip-thread/suspened-thread.

Am I looking at a threading problem here? I have now applied the "brute force singlethreading" version patch to the customer system. I am hopeful that this is a problem with the remoting proxies and not something in the actual runtime.

core "13601": 

Looks like an assertion failure, again in thread #1. mono_thread_execute_interruption() returned NULL?

Thread #2 is deepn inside sgen_restart_world() (waiting for mono_thread_pthreads_kill)

#0  0x00af2402 in __kernel_vsyscall ()
#1  0x001b7df0 in _quicksort () from /lib/libc.so.6
#2  0x001b9701 in random () from /lib/libc.so.6
#3  0x08101f99 in mono_handle_native_sigsegv (signal=6, ctx=0xb34feacc)
    at mini-exceptions.c:2335
#4  <signal handler called>
#5  0x00af2402 in __kernel_vsyscall ()
#6  0x001b7df0 in _quicksort () from /lib/libc.so.6
#7  0x001b9701 in random () from /lib/libc.so.6
#8  0x08292162 in monoeg_g_logv (log_domain=0x2e5ff4 "\234].", 
    log_level=G_LOG_LEVEL_ERROR, 
    format=0x829d450 "* Assertion: should not be reached at %s:%d\n", 
    args=0xb34fef74 "-\324\066\b\351\021") at goutput.c:175
#9  0x082921a7 in monoeg_assertion_message (
    format=0x829d450 "* Assertion: should not be reached at %s:%d\n")
    at goutput.c:195
#10 0x081dced9 in self_interrupt_thread (_unused=0x0) at threads.c:4585
#11 0x00b16289 in ?? ()
#12 0x00000000 in ?? ()

core 25200;

segv while doing something with threads? Looks like the stack is damaged too. The other threads seem pretty unremarkable. Thread  and #2 appears to be suspended while doing garbage collection (sgen_wait_for_suspend_ack())

#0  0x00f86402 in __kernel_vsyscall ()
#1  0x001b7df0 in _quicksort () from /lib/libc.so.6
#2  0x001b9701 in random () from /lib/libc.so.6
#3  0x08101f99 in mono_handle_native_sigsegv (signal=11, ctx=0xb7cb3d0c)
    at mini-exceptions.c:2335
#4  0x08156ae1 in mono_arch_handle_altstack_exception (sigctx=0xb7cb3d0c, 
    fault_addr=0x18, stack_ovf=0) at exceptions-x86.c:1163
#5  0x0805b4c4 in mono_sigsegv_signal_handler (_dummy=11, info=0xb7cb3c8c, 
    context=0xb7cb3d0c) at mini.c:6769
#6  <signal handler called>
#7  async_invoke_thread (data=0xa99b6c20) at threadpool.c:602
#8  0x081db78f in start_wrapper_internal (data=0x9a1f9c0) at threads.c:643
#9  start_wrapper (data=0x9a1f9c0) at threads.c:688
#10 0x0827aebe in thread_start_routine (args=0x981a80c) at wthreads.c:294
#11 0x0828d88e in inner_start_thread (arg=0x9a1d300)
    at mono-threads-posix.c:49
#12 0x00323852 in start_thread () from /lib/libpthread.so.0
#13 0x00262a8e in bdflush () from /lib/libc.so.6
#14 0xb6352b90 in ?? ()

core 3929: 

Looks exactly like 25200, including the segv in thread #1 (threadpool) and thread #2 doing GC.

#0  0x005ff402 in __kernel_vsyscall ()
#1  0x001b7df0 in _quicksort () from /lib/libc.so.6
#2  0x001b9701 in random () from /lib/libc.so.6
#3  0x08101f99 in mono_handle_native_sigsegv (signal=11, ctx=0xb60b4d0c)
    at mini-exceptions.c:2335
#4  0x08156ae1 in mono_arch_handle_altstack_exception (sigctx=0xb60b4d0c, 
    fault_addr=0x18, stack_ovf=0) at exceptions-x86.c:1163
#5  0x0805b4c4 in mono_sigsegv_signal_handler (_dummy=11, info=0xb60b4c8c, 
    context=0xb60b4d0c) at mini.c:6769
#6  <signal handler called>
#7  async_invoke_thread (data=0xaa0f22c0) at threadpool.c:602
#8  0x081db78f in start_wrapper_internal (data=0x9ce12d0) at threads.c:643
#9  start_wrapper (data=0x9ce12d0) at threads.c:688
#10 0x0827aebe in thread_start_routine (args=0x9af280c) at wthreads.c:294
#11 0x0828d88e in inner_start_thread (arg=0x9cdefb8)
    at mono-threads-posix.c:49
#12 0x00323852 in start_thread () from /lib/libpthread.so.0
#13 0x00262a8e in bdflush () from /lib/libc.so.6
#14 0xb60d5b90 in ?? ()
Comment 20 Zoltan Varga 2014-04-01 14:05:00 UTC
The runtime code is very complicated, so its very hard to track down these kinds of problems without a testcase, or at least a reproducible crash/stack trace.
Comment 21 Tom 2014-04-04 10:36:13 UTC
The system has been running for 18 hours without a crash, by disabling viewstate in the heavily used web page. (Fortunately, it is a mini application that just handles the load for one specific aspect of the app.)

I believe we may be seeing a variation on the bug described here: http://forcedtoadmin.blogspot.com/2013/12/unexpected-unloading-of-mono-web.html related to appdomain reloading, or creation of multiple viewstates. However, I have had no luck in reproducing this specific problem as he describes it.
Comment 22 Tom 2014-04-04 10:38:11 UTC
Did I say viewstate. SESSION state. :) We added "<sessionState mode="Off" />" to web.config and the problem is greatly reduced. Maybe elminated. Augh I need coffee.
Comment 23 Tom 2014-04-22 13:11:36 UTC
Customer reports that the problem no longer occurs with the sessionState disabled. The problems appear to be a side effect of session state cache management (in an app that never accesses or requests session state, hmm). We have a workaround, at this point. 

I don't see a way to truly fix the problem, short of redesigning the way a lot of the internals work. So I guess I'm done here, thanks.
Comment 24 Rodrigo Kumpera 2017-10-14 00:36:03 UTC
Thank you for your report!

It appears you are running a very old version of Mono. Could you please try to update to any recent version and try to reproduce the issue again.

If the issue still persists please include the version information and change the bug status to NEW.

Note You need to log in before you can comment on or make changes to this bug.