Bug 25629

Summary: nunit-console leaves zombie processes in mono 3.12 vs proper behavior in prior versions (3.10)
Product: [Mono] Runtime Reporter: Jonathan Shore <jonathan.shore>
Component: GeneralAssignee: Bugzilla <bugzilla>
Status: REOPENED ---    
Severity: normal CC: mono-bugs+mono, mono-bugs+runtime, vargaz
Priority: ---    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS   
Tags: Is this bug a regression?: ---
Last known good build:
Attachments: test project w/ nunit issue

Description Jonathan Shore 2014-12-28 08:50:23 UTC
There is a behavioral difference between mono 3.12 (latest beta) and 3.10 in that if I run the latest production nunit-console (2.6.3.0) in a build script it runs the tests, but leaves behind TestAgency processes running at 100% cpu that never exit.  

I suspect that there is a library change related to concurrent collections, processes, or threads, whereby these processes are left spinning.

If you are not able to reproduce can try to provide more info.
Comment 1 Zoltan Varga 2015-01-02 19:07:10 UTC
Can you attach a testcase or provide reproduction instructions ?
Comment 2 Jonathan Shore 2015-01-03 09:32:39 UTC
Created attachment 9234 [details]
test project w/ nunit issue

See the README in the tarred directory.  I chopped out the near-minimal src set and 3 unit tests that demonstrate the problem.  The behavior between 3.10.x and 3.12.x on OS X is different (I have not tried on linux).

I am using a standard distribution of unit 3.6.2 (which I may have recompiled with mono to get appropriate build).
Comment 3 Zoltan Varga 2015-01-03 09:51:50 UTC
I can reproduce this with master. gdb shows the following:

It looks like the unload thread is stuck trying to lock the main domain.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
(gdb) t apply all bt

Thread 8 (process 7928):
#0  0x9544c7af in pthread_mutex_lock ()
#1  0x002dc75d in mono_thread_destroy_tls (tls_offset=16777368) at threads.c:4006
#2  0x002dc863 in mono_thread_destroy_domain_tls (domain=0x117a280) at threads.c:4033
#3  0x0031155e in mono_domain_free (domain=0x117a280, force=0) at domain.c:1124
#4  0x0030c62d in unload_thread_main (arg=0x10c1a00) at appdomain.c:2358
#5  0x00401012 in inner_start_thread (arg=0xb051ca08) at mono-threads-posix.c:92
#6  0x954495fb in _pthread_body ()
#7  0x95449485 in _pthread_start ()
#8  0x9544ecf2 in thread_start ()

Thread 7 (process 7928):
#0  0x990e87ca in __psynch_cvwait ()
#1  0x9544bd1d in _pthread_cond_wait ()
#2  0x9544dbd9 in pthread_cond_wait$UNIX2003 ()
#3  0x003c00b6 in _wapi_handle_timedwait_signal_handle (handle=0x122, timeout=0x0, alertable=1, poll=0) at handles.c:1615
#4  0x003c017c in _wapi_handle_wait_signal_handle (handle=0x122, alertable=1) at handles.c:1560
#5  0x003de2a0 in WaitForSingleObjectEx (handle=0x122, timeout=4294967295, alertable=1) at wait.c:194
#6  0x0030c270 in mono_domain_try_unload (domain=0x117a280, exc=0xb051cba0) at appdomain.c:2476
#7  0x0030b945 in mono_domain_unload (domain=0x117a280) at appdomain.c:2385
#8  0x0030b910 in ves_icall_System_AppDomain_InternalUnload (domain_id=1) at appdomain.c:2038
#9  0x051b9094 in ?? ()
#10 0x051b8d38 in ?? ()
#11 0x051b8928 in ?? ()
#12 0x03dd00ad in ?? ()
#13 0x03a703c2 in ?? ()
#14 0x0001563d in mono_jit_runtime_invoke (method=0x19ebec0, obj=0x2b338a0, params=0xb051ceb0, exc=0x0) at mini.c:6595
#15 0x0031b120 in mono_runtime_invoke (method=0x19ebec0, obj=0x2b338a0, params=0xb051ceb0, exc=0x0) at object.c:2863
#16 0x00324482 in mono_runtime_delegate_invoke (delegate=0x2b338a0, params=0xb051ceb0, exc=0x0) at object.c:3574
#17 0x002de85e in start_wrapper_internal (data=0x10bff00) at threads.c:669
#18 0x002de4c2 in start_wrapper (data=0x10bff00) at threads.c:710
#19 0x00401012 in inner_start_thread (arg=0xbfffee18) at mono-threads-posix.c:92
#20 0x954495fb in _pthread_body ()
#21 0x95449485 in _pthread_start ()
#22 0x9544ecf2 in thread_start ()

Thread 6 (process 7928):
#0  0x990e7d2e in accept$UNIX2003 ()
#1  0x003d69bf in _wapi_accept (fd=4, addr=0x0, addrlen=0x0) at sockets.c:221
#2  0x002bbaf5 in ves_icall_System_Net_Sockets_Socket_Accept_internal (sock=4, error=0xb0318c00, blocking=1) at socket-io.c:877
#3  0x03dd0e18 in ?? ()
#4  0x03dd0384 in ?? ()
#5  0x03dd02d4 in ?? ()
#6  0x03dd0138 in ?? ()
#7  0x03dd00ad in ?? ()
#8  0x03a703c2 in ?? ()
#9  0x0001563d in mono_jit_runtime_invoke (method=0x19ebec0, obj=0x2860b38, params=0xb0318eb0, exc=0x0) at mini.c:6595
#10 0x0031b120 in mono_runtime_invoke (method=0x19ebec0, obj=0x2860b38, params=0xb0318eb0, exc=0x0) at object.c:2863
#11 0x00324482 in mono_runtime_delegate_invoke (delegate=0x2860b38, params=0xb0318eb0, exc=0x0) at object.c:3574
#12 0x002de85e in start_wrapper_internal (data=0x1169d00) at threads.c:669
#13 0x002de4c2 in start_wrapper (data=0x1169d00) at threads.c:710
#14 0x00401012 in inner_start_thread (arg=0xbfffecc8) at mono-threads-posix.c:92
#15 0x954495fb in _pthread_body ()
#16 0x95449485 in _pthread_start ()
#17 0x9544ecf2 in thread_start ()

Thread 5 (process 7928):
#0  0x990e9046 in __workq_kernreturn ()
#1  0x9544adcf in _pthread_wqthread ()
#2  0x9544ecce in start_wqthread ()

Thread 4 (process 7928):
#0  0x990e9046 in __workq_kernreturn ()
#1  0x9544adcf in _pthread_wqthread ()
#2  0x9544ecce in start_wqthread ()

Thread 3 (process 7928):
#0  0x990e9992 in kevent64 ()
#1  0x93e35899 in _dispatch_mgr_invoke ()
#2  0x93e35532 in _dispatch_mgr_thread ()

Thread 2 (process 7928):
#0  0x990e3fb6 in semaphore_wait_trap ()
#1  0x003f6fbf in mono_sem_wait (sem=0x4f667c, alertable=1) at mono-semaphore.c:103
#2  0x00316185 in finalizer_thread (unused=0x0) at gc.c:1077
#3  0x002de7c9 in start_wrapper_internal (data=0x1150cf0) at threads.c:663
#4  0x002de4c2 in start_wrapper (data=0x1150cf0) at threads.c:710
#5  0x00401012 in inner_start_thread (arg=0xbffff298) at mono-threads-posix.c:92
#6  0x954495fb in _pthread_body ()
#7  0x95449485 in _pthread_start ()
#8  0x9544ecf2 in thread_start ()

Thread 1 (process 7928):
#0  0x990e87ca in __psynch_cvwait ()
#1  0x9544bd1d in _pthread_cond_wait ()
#2  0x9544dc25 in pthread_cond_timedwait$UNIX2003 ()
#3  0x003c009c in _wapi_handle_timedwait_signal_handle (handle=0x121, timeout=0xbfffefb0, alertable=1, poll=0) at handles.c:1613
#4  0x003de2d1 in WaitForSingleObjectEx (handle=0x121, timeout=30000, alertable=1) at wait.c:196
#5  0x002d63ac in ves_icall_System_Threading_Thread_Join_internal (this=0x155c5d0, ms=30000, thread=0x121) at threads.c:1334
#6  0x051b8868 in ?? ()
#7  0x051b87c8 in ?? ()
#8  0x051b86c8 in ?? ()
#9  0x051b8574 in ?? ()
#10 0x03de4ff8 in ?? ()
#11 0x051b847a in ?? ()
#12 0x03ddbd77 in ?? ()
#13 0x006bf184 in ?? ()
#14 0x006bded4 in ?? ()
#15 0x006bdf8e in ?? ()
#16 0x0001563d in mono_jit_runtime_invoke (method=0x950cc8, obj=0x0, params=0xbffff52c, exc=0x0) at mini.c:6595
#17 0x0031b120 in mono_runtime_invoke (method=0x950cc8, obj=0x0, params=0xbffff52c, exc=0x0) at object.c:2863
#18 0x00324f62 in mono_runtime_exec_main (method=0x950cc8, args=0x2800710, exc=0x0) at object.c:4120
#19 0x00324c77 in mono_runtime_run_main (method=0x950cc8, argc=4, argv=0xbffffa88, exc=0x0) at object.c:3746
#20 0x000e690e in mono_jit_exec (domain=0x745eb0, assembly=0x383a380, argc=5, argv=0xbffffa84) at driver.c:1005
#21 0x000ea887 in main_thread_handler (user_data=0xbffff948) at driver.c:1064
#22 0x000e91e0 in mono_main (argc=6, argv=0xbffffa80) at driver.c:2024
#23 0x00002a78 in mono_main_with_options (argc=6, argv=0xbffffa80) at main.c:91
#24 0x00002670 in main (argc=6, argv=0xbffffa80) at main.c:122
(gdb) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Comment 4 Jonathan Shore 2015-01-03 10:00:50 UTC
To make it more interesting, this only occurs if one runs the tests.NUnitLRU.TestLRU.  If instead one runs the test:

mono <path-to-nunit>/nunit-console.exe -nologo -nodots -labels -run tests.NUnitStableSort Test/bin/Debug/com.stg.core.dll

the problem does not occur.  So something about running the tests.NUnitLRU.TestLRU causes 2.12 to fail later in acquiring the lock.

Though I would mention as may make debugging easier by contrasting the situations.
Comment 5 Jonathan Shore 2015-01-03 10:11:48 UTC
One other note.  The LRUCache (which is being tested here), indirectly is using a TLS variable in the LinkedQueue implementation.  As an experiment I changed the implementation of LinkedQueue not to use an ObjectPool that was in TLS and nunit performed properly.

I suspect that the use of ThreadLocal triggers a bug in the low-level threading sub-sys.
Comment 6 Zoltan Varga 2015-01-03 13:36:30 UTC
Fixed in mono master b7c3485cdabbbef3bd744f436d8f62c3b790472d.
Comment 7 Jonathan Shore 2015-01-03 14:44:16 UTC
Thanks again for a fast investigation & resolution, Zoltan!
Comment 8 Jonathan Shore 2015-02-07 12:04:02 UTC
The same behavior as detailed above occurs on the latest distributed 2.12.x (just was pushed to the stream today).

Did the above fix make it into 2.12.x?  Or is the fix not sufficient?
Comment 9 Zoltan Varga 2015-02-07 14:59:35 UTC
The fix is in mono master, not in 3.12. It might make it into a future 3.12.x version.