Bug 28369

Summary: mono runtime crash "assertion 'hash != NULL' failed"
Product: [Mono] Runtime Reporter: matthi.d
Component: GeneralAssignee: Marek Safar <masafa>
Status: RESOLVED FIXED    
Severity: normal CC: kumpera, mono-bugs+mono, mono-bugs+runtime, vargaz
Priority: ---    
Version: 3.12.0   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Tags: Is this bug a regression?: ---
Last known good build:

Description matthi.d 2015-03-24 13:03:15 UTC
Description of Problem:
I think this are actually multiple bugs.

If you create a new AppDomain which watches for a DomainUnload event on an existing AppDomain you get inconsistent behavior which look like several race conditions in the mono runtime.

Steps to reproduce the problem:
1. git clone https://github.com/matthid/mono-runtime-bug.git
2. cd mono-runtime-bug
3. ./build.sh
4a. (Windows/git bash): i=0; while [ $i -lt 100 ] ; do ./mono-runtime-bug/bin/Debug/mono-runtime-bug.exe; let i=i+1; done
4b. (Linux): for i in `seq 1 100`; do mono mono-runtime-bug/bin/Debug/mono-runtime-bug.exe; done


Actual Results:
There are three possible results on mono

1.

Switching to second AppDomain...
Setting up cleanup domain
Waiting for exit.
Finish work of second AppDomain
unload second AppDomain
Notice AppDomain.DomainUnload...
Waiting for Shutdown
ThreadAbortException
AppDomain unloaded
Work finished.
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10

2.

Switching to second AppDomain...
Setting up cleanup domain
Waiting for exit.
Finish work of second AppDomain
unload second AppDomain
Notice AppDomain.DomainUnload...
Waiting for Shutdown
AppDomain alive
AppDomainUnloadedException
AppDomain unloaded
Work finished.
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10

3.

Switching to second AppDomain, for RazorEngine...
Setting up cleanup domain
Waiting for exit.
Finish work of second AppDomain
unload second AppDomain
Notice AppDomain.DomainUnload...
Waiting for Shutdown
/var/tmp/portage/dev-lang/mono-3.12.1/work/mono-3.12.1/mono/metadata/mono-hash.c:265: assertion 'hash != NULL' failed
/var/tmp/portage/dev-lang/mono-3.12.1/work/mono-3.12.1/mono/metadata/mono-hash.c:423: assertion 'hash != NULL' failed
Stacktrace:


Native stacktrace:

        mono(+0xd5ce0) [0x3756fa4dce0]
        mono(+0x13950f) [0x3756fab150f]
        mono(+0x433c5) [0x3756f9bb3c5]
        /lib64/libpthread.so.0(+0x11260) [0x3756ee42260]
        mono(+0x1806d6) [0x3756faf86d6]
        mono(+0x1808b0) [0x3756faf88b0]
        mono(+0x181317) [0x3756faf9317]
        mono(+0x256b57) [0x3756fbceb57]
        mono(+0x23007d) [0x3756fba807d]
        mono(mono_gc_collect+0x52) [0x3756fba9152]
        mono(+0x1eee8b) [0x3756fb66e8b]
        mono(+0x292ccf) [0x3756fc0accf]
        /lib64/libpthread.so.0(+0x91da) [0x3756ee3a1da]
        /lib64/libc.so.6(clone+0x6d) [0x3756eb7720d]

Debug info from gdb:

warning: File "/usr/bin/mono-sgen-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /usr/bin/mono-sgen-gdb.py
line to your configuration file "/home/matthid/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/matthid/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[New LWP 21549]
[New LWP 21548]
[New LWP 21543]
[New LWP 21539]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
  Id   Target Id         Frame
  5    Thread 0x3756b51b700 (LWP 21539) "Finalizer" 0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
  4    Thread 0x3756b1ff700 (LWP 21543) "Timer-Scheduler" 0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
  3    Thread 0x3756affe700 (LWP 21548) "mono" 0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
  2    Thread 0x3756adfd700 (LWP 21549) "Unload thread f" 0x000003756ee41e27 in waitpid () from /lib64/libpthread.so.0
* 1    Thread 0x3756f952780 (LWP 21538) "mono" 0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6

Thread 5 (Thread 0x3756b51b700 (LWP 21539)):
#0  0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
#1  0x000003756fb9d846 in ?? ()
#2  <signal handler called>
#3  0x000003756ee4063e in sem_wait () from /lib64/libpthread.so.0
#4  0x000003756fc042df in mono_sem_wait ()
#5  0x000003756fb6ed54 in ?? ()
#6  0x000003756fb4d970 in ?? ()
#7  0x000003756fc0accf in ?? ()
#8  0x000003756ee3a1da in start_thread () from /lib64/libpthread.so.0
#9  0x000003756eb7720d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x3756b1ff700 (LWP 21543)):
#0  0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
#1  0x000003756fb9d846 in ?? ()
#2  <signal handler called>
#3  0x000003756ee3e81e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4  0x000003756fbdd6e3 in ?? ()
#5  0x000003756fbf39a3 in ?? ()
#6  0x000003756fb4d090 in ?? ()
#7  0x000003756fb4ebd6 in ?? ()
#8  0x00000000418c7c54 in ?? ()
#9  0x0000000000000001 in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x3756affe700 (LWP 21548)):
#0  0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
#1  0x000003756fb9d846 in ?? ()
#2  <signal handler called>
#3  0x000003756eb68cdd in write () from /lib64/libc.so.6
#4  0x000003756eb03e73 in _IO_file_write () from /lib64/libc.so.6
#5  0x000003756eb03d52 in ?? () from /lib64/libc.so.6
#6  0x000003756eb050e0 in _IO_do_write () from /lib64/libc.so.6
#7  0x000003756eb04701 in _IO_file_xsputn () from /lib64/libc.so.6
#8  0x000003756ead4689 in vfprintf () from /lib64/libc.so.6
#9  0x000003756eb855dd in __fprintf_chk () from /lib64/libc.so.6
#10 0x000003756fc11adb in ?? ()
#11 0x000003756fc11e62 in ?? ()
#12 0x000003756fc11f43 in ?? ()
#13 0x000003756fb7a6a2 in ?? ()
#14 0x00000000415d1aaa in ?? ()
#15 0x0000000000000001 in ?? ()
#16 0x0000000000000002 in ?? ()
#17 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x3756adfd700 (LWP 21549)):
#0  0x000003756ee41e27 in waitpid () from /lib64/libpthread.so.0
#1  0x000003756fa4dd6e in ?? ()
#2  0x000003756fab150f in ?? ()
#3  0x000003756f9bb3c5 in ?? ()
#4  <signal handler called>
#5  0x000003756faf86d6 in ?? ()
#6  0x000003756faf88b0 in ?? ()
#7  0x000003756faf9317 in ?? ()
#8  0x000003756fbceb57 in ?? ()
#9  0x000003756fba807d in ?? ()
#10 0x000003756fba9152 in mono_gc_collect ()
#11 0x000003756fb66e8b in ?? ()
#12 0x000003756fc0accf in ?? ()
#13 0x000003756ee3a1da in start_thread () from /lib64/libpthread.so.0
#14 0x000003756eb7720d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x3756f952780 (LWP 21538)):
#0  0x000003756eac2a93 in sigsuspend () from /lib64/libc.so.6
#1  0x000003756fb9d846 in ?? ()
#2  <signal handler called>
#3  0x000003756ee3e44a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4  0x000003756fbdd773 in ?? ()
#5  0x000003756fbf3b7b in ?? ()
#6  0x000003756fb69f12 in mono_domain_try_unload ()
#7  0x000003756fb6a017 in mono_domain_unload ()
#8  0x00000000415d1859 in ?? ()
#9  0x0000037570061c70 in ?? ()
#10 0x000003ff35d60e40 in ?? ()
#11 0x000000004154c0d0 in ?? ()
#12 0x000003ff35d60e40 in ?? ()
#13 0x000003756c8b46b8 in ?? ()
#14 0x00000375700596b0 in ?? ()
#15 0x00000000415d140c in ?? ()
#16 0x000003ff35d60b40 in ?? ()
#17 0x000003ff35d60a20 in ?? ()
#18 0x00000000415d13d8 in ?? ()
#19 0x000000004154bd80 in ?? ()
#20 0x000000004154be6c in ?? ()
#21 0x0000037570061c70 in ?? ()
#22 0x000003ff35d60b40 in ?? ()
#23 0x000003ff35d60e40 in ?? ()
#24 0x000000004154c0d0 in ?? ()
#25 0x000000004154bd80 in ?? ()
#26 0x000003756c8b46b8 in ?? ()
#27 0x000003ff00000000 in ?? ()
#28 0x000003756fc0c475 in ?? ()
#29 0x000003756f9be1e0 in ?? ()
#30 0x000003756fb788b0 in mono_runtime_invoke ()
#31 0x000003756fb7ad60 in mono_runtime_exec_main ()
#32 0x000003756fa1e6de in mono_main ()
#33 0x000003756eaaed55 in __libc_start_main () from /lib64/libc.so.6
#34 0x000003756f9af345 in ?? ()

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

/var/tmp/portage/dev-lang/mono-3.12.1/work/mono-3.12.1/mono/metadata/mono-hash.c:423: assertion 'hash != NULL' failed
Abgebrochen


Expected Results (always):

Switching to second AppDomain, for RazorEngine...
Setting up cleanup domain
Waiting for exit.
Finish work of second AppDomain
unload second AppDomain
Notice AppDomain.DomainUnload...
Waiting for Shutdown
AppDomainUnloadedException
AppDomain unloaded
Work finished.
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10


How often does this happen? 

Result 1. happens the most and is wrong, 2. is the correct output and does not happen as often as 1. 
And 3. happens only sometimes.

Sorry for the formatting...

Additional Information:

mono 3.12.1 on gentoo (build from source) was used for testing, but the bug was initially noticed on travis.
Comment 1 matthi.d 2015-03-24 13:11:26 UTC
Sometimes I see this as well (even more rare):

Switching to second AppDomain...
Setting up cleanup domain
Waiting for exit.
Finish work of second AppDomain
unload second AppDomain
Notice AppDomain.DomainUnload...
Waiting for Shutdown
lookup_data_table () failed for 0x3576788e1b0

Stacktrace:

  at <unknown> <0xffffffff>
lookup_data_table () failed for 0x3576788e1b0

Abgebrochen
Comment 2 Zoltan Varga 2015-03-24 13:33:33 UTC
Can't reproduce this on osx with:
Mono JIT compiler version 3.12.1 ((detached/b7764aa Fri Mar  6 15:32:47 EST 2015)
Comment 3 matthi.d 2015-03-24 16:14:32 UTC
You are right, I can see that on travis this is quite rare but it's there (maybe I can see this more often because on gentoo I use a debug build of mono?):
- https://travis-ci.org/matthid/mono-runtime-bug/builds/55689717#L3219
- https://travis-ci.org/matthid/mono-runtime-bug/builds/55691992#L1261

At least the "ThreadAbortException" part of this bug I see quite frequently on my gentoo box so if there is something I can do to help (like sending a dump), please let me know.

In the project using this I have even seen "RemotingExceptions" instead of "AppDomainUnloadedException" as well.

I will try to find a better way to reproduce but this the best I have for now.
I have noticed the following until now:
- Adding Thread.Sleep at the start of DomainUnloaded seems to improve the situation (less wrong results), see https://travis-ci.org/matthid/mono-runtime-bug/builds/55691992#L1261 but it still crashes eventually.
- Calling FriendlyName more often seems to worsen the situation (more crashes)
  https://travis-ci.org/matthid/mono-runtime-bug/builds/55695613
  https://travis-ci.org/matthid/mono-runtime-bug/builds/55698401

Can you try to reproduce this again with the second_try branch (which seems to crash more often)? (Let it run several minutes)

And just for reference my exact mono version (on my gentoo box)
mono --version
Mono JIT compiler version 3.12.1 (mono-3.12.0-branch/4cb3f77 Wed Mar 18 14:22:06 CET 2015)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  amd64
        Disabled:      none
        Misc:          softdebug
        LLVM:          supported, not enabled.
        GC:            sgen
Comment 4 Rodrigo Kumpera 2015-04-02 11:34:42 UTC
With master and 4.0 it consistently fails with this:

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object

Server stack trace: 
  at System.Runtime.Remoting.TypeInfo..ctor (System.Type type) <0x105519da0 + 0x001fb> in <filename unknown>:0 
  at System.Runtime.Remoting.ObjRef..ctor (System.String typeName, System.String uri, IChannelInfo cinfo) <0x1055a2720 + 0x000d5> in <filename unknown>:0 
  at System.Runtime.Remoting.Messaging.CADMessageBase.UnmarshalArgument (System.Object arg, System.Collections.ArrayList args, System.Type argType) <0x105562ad0 + 0x00299> in <filename unknown>:0 
  at System.Runtime.Remoting.Messaging.CADMessageBase.UnmarshalArguments (System.Object[] arguments, System.Collections.ArrayList args, System.Type[] sig) <0x1055629c0 + 0x00094> in <filename unknown>:0 
  at System.Runtime.Remoting.Messaging.CADMethodCallMessage.GetArgs (System.Collections.ArrayList args) <0x105561b30 + 0x0004d> in <filename unknown>:0 
  at System.Runtime.Remoting.Messaging.MethodCall..ctor (System.Runtime.Remoting.Messaging.CADMethodCallMessage msg) <0x105535790 + 0x00097> in <filename unknown>:0 
  at System.AppDomain.ProcessMessageInDomain (System.Byte[] arrRequest, System.Runtime.Remoting.Messaging.CADMethodCallMessage cadMsg, System.Byte[]& arrResponse, System.Runtime.Remoting.Messaging.CADMethodReturnMessage& cadMrm) <0x105535590 + 0x000b6> in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) System.AppDomain:ProcessMessageInDomain (byte[],System.Runtime.Remoting.Messaging.CADMethodCallMessage,byte[]&,System.Runtime.Remoting.Messaging.CADMethodReturnMessage&)
  at System.Runtime.Remoting.Channels.CrossAppDomainSink.ProcessMessageInDomain (System.Byte[] arrRequest, System.Runtime.Remoting.Messaging.CADMethodCallMessage cadMsg) <0x105534ec0 + 0x0006a> in <filename unknown>:0
Comment 5 Marek Safar 2015-04-08 03:39:08 UTC
Fixed the master issue and now it works as expected

witching to second AppDomain...
Setting up cleanup domain
Waiting for exit.
Finish work of second AppDomain
unload second AppDomain
Notice AppDomain.DomainUnload...
Waiting for Shutdown
AppDomain alive
AppDomainUnloadedException
AppDomain unloaded
Work finished.
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10
Comment 6 matthi.d 2015-05-15 10:27:44 UTC
Is this fix contained in 4.0.1?
I still can see "1." from my initial post with 4.0.1 so I don't think this is completely fixed.

For example: https://travis-ci.org/matthid/mono-runtime-bug/builds/55698401#L2060 (note that this is run with 4.0.1)

Steps:
git clone --depth=50 --branch=second_try git://github.com/matthid/mono-runtime-bug.git
./build.sh # crashes eventually (the bug actually happens as soon as ThreadAbortException is shown in the output).

I think the crash is a follow up problem from the race condition between AppDomainUnloadException and ThreadAbortException. Once a ThreadAbortException is thrown we apparently can wait/force for the runtime to crash.
Comment 7 Marek Safar 2015-05-15 10:30:37 UTC
No, you need mono master which is 4.1