Bug 19906 - `socket_io_add()` interaction with `NSOpenPanel.RunModal()` causes stack overflow, EXC_BAD_ACCESS, sigsegv, Illegal instruction: 4
Summary: `socket_io_add()` interaction with `NSOpenPanel.RunModal()` causes stack over...
Alias: None
Product: Runtime
Classification: Mono
Component: JIT ()
Version: unspecified
Hardware: PC Mac OS
: --- normal
Target Milestone: ---
Assignee: Bugzilla
Depends on:
Reported: 2014-05-19 15:34 UTC by Brendan Zagaeski (Xamarin Team, assistant)
Modified: 2014-07-16 22:42 UTC (History)
6 users (show)

Is this bug a regression?: ---
Last known good build:

Test case (19.42 KB, application/zip)
2014-05-19 15:34 UTC, Brendan Zagaeski (Xamarin Team, assistant)
Crash logs, backtraces (68.48 KB, application/zip)
2014-05-19 15:36 UTC, Brendan Zagaeski (Xamarin Team, assistant)
Updated test case: Socket send/receive (20.14 KB, application/zip)
2014-06-06 03:32 UTC, Brendan Zagaeski (Xamarin Team, assistant)
Alternative test case: System.Diagnostics.Process (19.67 KB, application/zip)
2014-06-25 17:31 UTC, Brendan Zagaeski (Xamarin Team, assistant)
Updated test case: pthreads and semaphores (22.05 KB, application/zip)
2014-07-07 21:17 UTC, Brendan Zagaeski (Xamarin Team, assistant)

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.

Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:

Description Brendan Zagaeski (Xamarin Team, assistant) 2014-05-19 15:34:43 UTC
Created attachment 6838 [details]
Test case

It seems that calling `WebClient.DownloadString()` starts up some threads that interfere with `NSOpenPanel.RunModal()`.

## Steps to reproduce

1. Make sure you can access "http://www.example.com". Otherwise the test app will just pause in `FinishedLaunching()`, and then eventually get killed by the OS.

2. Build and run the attached test case. This just calls `WebClient.DownloadString()`, and then presents the user a button that runs `NSOpenPanel.RunModal()`.

3. Soon after launching the app, click the button, and then double-click a file in the new modal panel to select the file and close the panel.

4. If the app doesn't crash, repeat the process of clicking the button and selecting a file.

The crash seems to be a little easier to produce within roughly the first 10 seconds after starting the app, but this isn't strictly required. It also is not required to choose a file in the NSOpenPanel window. Canceling the window can be sufficient. Pacing of the UI interaction seems to be important.

## "Workarounds"

Remove `WebClient` from the app.

## Results

The app crashes. The crash can behave in a few slightly different ways.

### Launch app from lldb, breaks on EXC_BAD_ACCESS

(Attached log: lldb_EXC_BAD_ACCESS.txt)

If you launch the app using lldb [1], lldb seems to interrupt the app fairly reliably when the app hits an EXC_BAD_ACCESS.

> $ lldb TestApp_Mac/bin/Debug/TestApp_Mac.app/Contents/MacOS/TestApp_Mac
> (lldb) process launch

> Process 12845 stopped
> * thread #6: tid = 0x34c53, 0x933677e9 libsystem_pthread.dylib`_pthread_find_thread + 93, stop reason = EXC_BAD_ACCESS (code=2, address=0xb0093054)

If you "continue" a few times from lldb, you can allow some of the threads to keep working. After continuing about 10 times, I noticed a new thread. After a few more continues, the `monobt` contained the following lines:

> frame #10: 0x069482fe System.Net.Sockets.Socket:Dispose () + 0x16 (0x69482e8 0x6948311) [0x2d35e00 - TestApp_Mac.exe]
> frame #11: 0x069482de System.Net.Sockets.Socket:Close () + 0x1e (0x69482c0 0x69482e3) [0x2d35e00 - TestApp_Mac.exe]
> frame #12: 0x0697ed08
> frame #13: 0x0697ea70 System.Net.WebConnectionGroup:TryRecycle (System.TimeSpan,System.DateTime&) + 0x368 (0x697e708 0x697eac2) [0x2d35e00 - TestApp_Mac.exe]
> frame #14: 0x0697e3b4 System.Net.ServicePoint:CheckAvailableForRecycling (System.DateTime&) + 0x204 (0x697e1b0 0x697e64e) [0x2d35e00 - TestApp_Mac.exe]
> frame #15: 0x0697e1a4 System.Net.ServicePoint:IdleTimerCallback (object) + 0x24 (0x697e180 0x697e1a9) [0x2d35e00 - TestApp_Mac.exe]
> frame #16: 0x0697e13a System.Threading.Timer/Scheduler:TimerCB (object) + 0x12a (0x697e010 0x697e17f) [0x2d35e00 - TestApp_Mac.exe]

This new thread might or might not be relevant.


(Attached logs: Two different example stack traces are attached. `EXC_CRASH.2.crash` shows a case where the app crashed while closing the NSOpenPanel rather than while handling the button click.)

If you run the app by opening the `.app` bundle in Finder, the crash log will most likely show an "EXC_CRASH" exception. The stderr from the app (in `system.log`) shows "Illegal instruction: 4".

> Exception Type:  EXC_CRASH (SIGILL)
> Exception Codes: 0x0000000000000000, 0x0000000000000000

> Application Specific Information:
> Performing @selector(ButtonClick:) from sender NSButton 0x7854ae0

> Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
> 0   com.apple.CoreFoundation      	0x92e88124 _CFRuntimeCreateInstance + 4
> 1   com.apple.CoreFoundation      	0x92e87d35 CFBasicHashCreate + 133
> 2   com.apple.CoreFoundation      	0x92e8b31b __CFDictionaryCreateGeneric + 859
> 3   com.apple.CoreFoundation      	0x92ed6bc6 CFDictionaryCreate + 70
> 4   com.apple.CoreServices.CarbonCore	0x909920e8 RequestVolumeNotificationWithFlags + 297
> 5   com.apple.CoreServices.CarbonCore	0x90991fad RegisterForVolumeMountNotifications + 78
> 6   com.apple.CoreServices.CarbonCore	0x90991ae2 SaveToFolderCache + 441
> 7   com.apple.CoreServices.CarbonCore	0x90990463 FindFolderGuts + 1428
> 8   com.apple.CoreServices.CarbonCore	0x909908c6 ResolveRelativeFolder + 73
> 9   com.apple.CoreServices.CarbonCore	0x9099020d FindFolderGuts + 830
> 10  com.apple.CoreServices.CarbonCore	0x9098fe28 FSFindFolder + 232


(Attached log: EXC_BAD_ACCESS.crash)

If you run the app with debugging from Xamarin Studio, the crash log is more likely to show an "EXC_BAD_ACCESS" exception. 

> Exception Type:  EXC_BAD_ACCESS (SIGILL)
> Exception Codes: KERN_PROTECTION_FAILURE at 0x00000000000fd000

> Application Specific Information:
> Performing @selector(ButtonClick:) from sender NSButton 0x5601720

> Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
> 0   libsystem_kernel.dylib        	0x929d77ca __psynch_cvwait + 10
> 1   libsystem_pthread.dylib       	0x93369d8a _pthread_cond_wait + 837
> 2   libsystem_pthread.dylib       	0x9336a042 pthread_cond_timedwait_relative_np + 47
> 3   com.apple.FinderKit           	0x9b71f9dd TConditionVariable::WaitWithTimeout(TMutex&, unsigned long long, bool&) + 107
> 4   com.apple.FinderKit           	0x9b87fdea TNodeEngineNotificationHandler::WaitForOperationCompleted(unsigned long long) + 68
> 5   com.apple.FinderKit           	0x9b880619 TNodeEngine::ProcessNotification(TNodeEngineNotificationHandler*) + 173
> 6   com.apple.FinderKit           	0x9b8803fb TNodeEngine::OpenContainer(TFENode const&, unsigned long) + 199
> 7   com.apple.FinderKit           	0x9b6bfd63 TNodeBrowser::OpenContainer(TFENode const&, unsigned long) + 121
> 8   com.apple.FinderKit           	0x9b76305d -[FI_TBrowserViewController(DataSource) openContainer:] + 192
> 9   com.apple.FinderKit           	0x9b78205d -[FI_TListViewController openContainer:] + 58
> 10  com.apple.FinderKit           	0x9b762e52 -[FI_TBrowserViewController(DataSource) openTarget] + 73
> 11  com.apple.FinderKit           	0x9b781f36 -[FI_TListViewController openTarget] + 50
> 12  com.apple.FinderKit           	0x9b88611d -[FI_TBrowserContainerController commonFinishInitialization:] + 193
> 13  com.apple.FinderKit           	0x9b8900f8 -[FIContainerController createBrowserViewWithViewStyle:containerState:] + 750
> 14  com.apple.FinderKit           	0x9b88ccf3 -[FI_TBrowserContainerController buildBrowserView:containerState:] + 395
> 15  com.apple.FinderKit           	0x9b889a85 -[FI_TBrowserContainerController setTargetPath:withViewStyle:rebuildView:] + 1372
> 16  com.apple.FinderKit           	0x9b889fa5 -[FI_TBrowserContainerController forceSetTargetPath:withViewStyle:] + 80
> 17  com.apple.FinderKit           	0x9b7fe288 -[FIFinderViewGutsController _internalSetTargetPath:withViewStyle:] + 713

### Empty stack traces, lldb automatically invoked

Sometimes if you have the Xamarin Studio debugger attached, or enable trace output (for example [2]), then when the process dies it will automatically invoke `lldb` and dump thread backtraces.

[2] MONO_ENV_OPTIONS=--trace=T:System.Net.WebClient TestApp_Mac/bin/Debug/TestApp_Mac.app/Contents/MacOS/TestApp_Mac

#### One or two stack overflow lines might or might not appear

> Stack overflow in unmanaged: IP: 0x220234c, fault addr: 0xb02174cc
> Stack overflow in unmanaged: IP: 0x220234c, fault addr: 0xb02174cc

#### The Xamarin Studio debugger might or might not break on a StackOverflowException in a background thread, with a short managed call stack

> Unhandled Exception:
> System.StackOverflowException: The requested operation caused a stack overflow.
>   at (wrapper managed-to-native) MonoMac.ObjCRuntime.Messaging:int_objc_msgSend (intptr,intptr)
>   at MonoMac.Foundation.NSObject.RetainTrampoline (IntPtr this, IntPtr sel) [0x00007] in /Users/builder/data/lanes/xamcore-lion-1.8-branch/b8b75fd4/source/xamcore/src/Foundation/NSObject.cs:297 
>   at (wrapper native-to-managed) MonoMac.Foundation.NSObject:RetainTrampoline (intptr,intptr)

#### Then lldb attaches and dumps the native stack backtraces (SIGSTOP)

(Attached log: lldb_SIGSTOP.txt)

> Stacktrace:
> Native stacktrace:
> Debug info from gdb:

> Process 11778 stopped
> * thread #1: tid = 0x379bc, 0x929d2f7a libsystem_kernel.dylib`mach_msg_trap + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

> (lldb) * thread #1: tid = 0x379bc, 0x929d2f7a libsystem_kernel.dylib`mach_msg_trap + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x929d2f7a libsystem_kernel.dylib`mach_msg_trap + 10
>     frame #1: 0x929d216c libsystem_kernel.dylib`mach_msg + 68
>     frame #2: 0x92efbc09 CoreFoundation`__CFRunLoopServiceMachPort + 169
>     frame #3: 0x92efb1e1 CoreFoundation`__CFRunLoopRun + 1393
>     frame #4: 0x92efa9fa CoreFoundation`CFRunLoopRunSpecific + 394
>     frame #5: 0x92efa85b CoreFoundation`CFRunLoopRunInMode + 123

There's usually another thread that shows a `mono_handle_native_sigsegv()` call:

>   thread #4: tid = 0x379c5, 0x929d7ff2 libsystem_kernel.dylib`__wait4 + 10
>     frame #0: 0x929d7ff2 libsystem_kernel.dylib`__wait4 + 10
>     frame #1: 0x9bb44ec5 libsystem_c.dylib`waitpid$UNIX2003 + 48
>     frame #2: 0x020ab5f9 libmono-2.0.dylib`mono_handle_native_sigsegv(signal=11, ctx=0x0279dfe0) + 489 at mini-exceptions.c:2305
>     frame #3: 0x020fd9d5 libmono-2.0.dylib`mono_arch_handle_altstack_exception(sigctx=0x0279dfe0, fault_addr=0xb0093054, stack_ovf=0) + 149 at exceptions-x86.c:1170
>     frame #4: 0x02004351 libmono-2.0.dylib`mono_sigsegv_signal_handler(_dummy=10, info=0x0279dfa0, context=0x0279dfe0) + 369 at mini.c:6842
>     frame #5: 0x9a2cfdeb libsystem_platform.dylib`_sigtramp + 43
>     frame #6: 0x933677ea libsystem_pthread.dylib`_pthread_find_thread + 94
>     frame #7: 0x93367696 libsystem_pthread.dylib`_pthread_lookup_thread + 51
>     frame #8: 0x9336b96c libsystem_pthread.dylib`pthread_join$UNIX2003 + 87
>     frame #9: 0x0223e533 libmono-2.0.dylib`GC_pthread_join(thread=0xb0891000, retval=0x00000000) + 163 at pthread_support.c:1345

## The thread created by `WebClient`

It appears there is at least one thread that is present when `WebClient.DownloadString()` is called, and absent when `WebClient.DownloadString()` is _not_ called. Perhaps this thread starts up the "System.Threading.Timer/Scheduler:TimerCB" thread that appears when "continuing" in the lldb-launched scenario.

### `monobt` for the thread

> frame #0: 0x929d77ca libsystem_kernel.dylib`__psynch_cvwait + 10
> frame #1: 0x93369d1d libsystem_pthread.dylib`_pthread_cond_wait + 728
> frame #2: 0x9336bc25 libsystem_pthread.dylib`pthread_cond_timedwait$UNIX2003 + 71
> frame #3: 0x021f7924 libmono-2.0.dylib`_wapi_handle_timedwait_signal_handle [inlined] timedwait_signal_poll_cond + 27 at handles.c:1453
> frame #4: 0x021f7909 libmono-2.0.dylib`_wapi_handle_timedwait_signal_handle(handle=0x0000040a, timeout=0xb049cba0, alertable=1, poll=0) + 489 at handles.c:1543
> frame #5: 0x0220ab6a libmono-2.0.dylib`WaitForSingleObjectEx(handle=<unavailable>, timeout=99997, alertable=1) + 586 at wait.c:196
> frame #6: 0x0219edda libmono-2.0.dylib`mono_wait_uninterrupted + 122
> frame #7: 0x0219eead libmono-2.0.dylib`ves_icall_System_Threading_WaitHandle_WaitOne_internal(handle=0x0000040a, this=0x054c8eb8, ms=99997, exitContext=0) + 109 at threads.c:1438
> frame #8: 0x08c681c0 (wrapper managed-to-native) System.Threading.WaitHandle:WaitOne_internal (System.Threading.WaitHandle,intptr,int,bool) + 0x98 (0x8c68128 0x8c6821e) [0xf3e00 - TestApp_Mac.exe]
> frame #9: 0x08c67b20 System.Threading.WaitHandle:WaitOne (int,bool) + 0x128 (0x8c679f8 0x8c67bd5) [0xf3e00 - TestApp_Mac.exe]
> frame #10: 0x08c678dd System.Threading.WaitHandle:WaitOne (int) + 0x2d (0x8c678b0 0x8c678f3) [0xf3e00 - TestApp_Mac.exe]
> frame #11: 0x08c5d2e1 System.Threading.Timer/Scheduler:SchedulerThread () + 0xc41 (0x8c5c6a0 0x8c5d314) [0xf3e00 - TestApp_Mac.exe]
> frame #12: 0x08c5bc56 System.Threading.Thread:StartInternal () + 0x86 (0x8c5bbd0 0x8c5bcc2) [0xf3e00 - TestApp_Mac.exe]
> frame #13: 0x0078112d (wrapper runtime-invoke) object:runtime_invoke_void__this__ (object,intptr,intptr,intptr) + 0x11d (0x781010 0x781150) [0xf3e00 - TestApp_Mac.exe]
> frame #14: 0x0200e54c libmono-2.0.dylib`mono_jit_runtime_invoke(method=0x04cf7c64, obj=0x054c7dc8, params=0xb049cef4, exc=0x00000000) + 828 at mini.c:6727
> frame #15: 0x021cac3e libmono-2.0.dylib`mono_runtime_invoke(method=0x04cf7c64, obj=0x054c7dc8, params=0xb049cef4, exc=0x00000000) + 126 at object.c:2828
> frame #16: 0x021cadac libmono-2.0.dylib`mono_runtime_delegate_invoke(delegate=0x054c7dc8, params=0xb049cef4, exc=0x00000000) + 140 at object.c:3539
> frame #17: 0x0219c680 libmono-2.0.dylib`start_wrapper [inlined] start_wrapper_internal(data=0x08d2fb80) + 486 at threads.c:653
> frame #18: 0x0219c49a libmono-2.0.dylib`start_wrapper(data=0x08d2fb80) + 26 at threads.c:692
> frame #19: 0x0221de1d libmono-2.0.dylib`inner_start_thread(arg=0xbfffd6b0) + 253 at mono-threads-posix.c:94
> frame #20: 0x0223eebd libmono-2.0.dylib`GC_start_routine(arg=0x051481e0) + 93 at pthread_support.c:1502
> frame #21: 0x933675fb libsystem_pthread.dylib`_pthread_body + 144
> frame #22: 0x93367485 libsystem_pthread.dylib`_pthread_start + 130

## Version information
Xamarin Studio 5.0 (build 840)
Mono 3.4.0 ((no/c3fc3ba)
GTK+ 2.24.23 (Raleigh theme)
Xcode 5.1 (5084), Build 5B130a
Mac OS X 10.9.3

Also tested on Mac OS X 10.8.5.
Comment 1 Brendan Zagaeski (Xamarin Team, assistant) 2014-05-19 15:36:31 UTC
Created attachment 6839 [details]
Crash logs, backtraces
Comment 3 Chris Hamons 2014-05-20 09:22:37 UTC
I can reproduce the issue with the attached test case.

Thanks for providing it, it'll make tracking this down much easier.
Comment 4 Keith Boynton 2014-05-29 15:19:39 UTC
Do we have any progress with this please?
Comment 5 Brendan Zagaeski (Xamarin Team, assistant) 2014-05-29 17:50:34 UTC
This isn't a progress update really, but in quickly checking for possible workarounds, I've found that `WebRequest` and `HttpClient` cause the same problem too.

## `WebRequest`

> var request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
> var response = request.GetResponse();

## `HttpClient`

> var httpClient = new HttpClient();
> var task = httpClient.GetStringAsync("http://www.example.com");

The results were the same with both the Mono 4.5 implementation of `HttpClient` and the Microsoft NuGet implementation [1].

> [1] https://www.nuget.org/packages/Microsoft.Net.Http/2.2.22

## `NSUrlConnection`

In contrast, `NSUrlConnection.SendAsynchronousRequest()` does not cause the problem.

> var request = new NSUrlRequest(new NSUrl("http://www.example.com"));
> NSUrlConnection.SendAsynchronousRequest(request, NSOperationQueue.MainQueue, delegate(NSUrlResponse response, NSData data, NSError error) {
> });
Comment 6 Keith Boynton 2014-05-30 12:11:04 UTC
Thank you for your response Brendan, you've been extremely helpful.

I will certainly look into your suggestion around a temporary workaround, it may not be as simple as it sounds as I may have to recreate custom headers, cookies and other things within the requests to make it work properly.

Thank you again for your great efforts.

Comment 7 Keith Boynton 2014-06-05 08:56:52 UTC
On closer inspection there is a lot of effort here, it would require re-architecting the solution and implementing a lot of other stuff because the content fetching code resides in the logic layer not in the UI layer. Which is the whole point of Xamarin.

I would have to extract the content fetching class out of the logic, then implement it in a Mac specific way along with all the other request header/cookie management etc. which is not a small task.

Can we get this escalated?
Comment 8 Brendan Zagaeski (Xamarin Team, assistant) 2014-06-06 03:32:26 UTC
Created attachment 7001 [details]
Updated test case: Socket send/receive

Thanks to some nice suggestions from Martin, I was able to narrow the problem down even further.

The new attached test case uses a `Socket` directly. As far as I could tell, calling the synchronous `Socket.Send()` and `Socket.Receive()` methods did not cause a problem, but calling the asynchronous `BeginSend()` and `BeginReceive()` _did_.

Given this result, I suspect the problem is somehow caused by an interaction between:

1. Calling the `socket_pool_queue()` runtime method

2. Calling `NSOpenPanel.RunModal()`

3. Having an additional `Timer` or `Thread` running on one of the background threads for the duration of the call to `socket_pool_queue()`. Interestingly, it seems that starting this thread immediately as the app is launching increases the chances of hitting the crash compared to starting it later in response to a button click.

When running the app under the control of the Xamarin Studio debugger, the app will crash even if you comment out the call to `Thread.Sleep()`.

## EXC_BAD_ACCESS mostly on `start_wqthread()`

With this latest test app, `start_wqthread()` (libsystem_pthread.dylib) seems to be the most common location for the EXC_BAD_ACCESS.
Comment 9 Martin Baulig 2014-06-06 14:33:42 UTC
Hmm, that's really weird.

If `Timer` is causing the problems, then we might eventually be able to work around that.  For those of you not familiar with the web stack, here's what it's used for:

When you make a HTTP/1.1 request, the connection to the server is by default kept open (unless either client or server explicitly asks to close it) and will be reused for subsequent requests.  Each connection will be kept open until either the maximum number of connections has been reached or it's been idle for a certain period of time.

Both the maximum number of concurrent connections and the idle time is configurable via properties on the ServicePointManager.

What we need for the web stack is a way of

* making sure some callback is executed in a regular interval - it will check for connections that have been idle for too long and close them.

* being able to modify that interval from other threads - when the user modifies the ServicePointManager.MaxIdleTime property.

* lightweight implementation - a typical application will have 10-50 of these timer at all times while it's making web requests.

When I implemented this, I talked to a few of the runtime guys and they told me that `Timer` is a very lightweight class, so it's easier to have one timer per ServicePoint (ie. one for each concurrent connection) rather than having a "global" one that needs to deal with different idle times.

However, we can change this if the `Timer` class turns out to be a problem and someone could suggest a different implementation.
Comment 10 Brendan Zagaeski (Xamarin Team, assistant) 2014-06-25 17:31:18 UTC
Created attachment 7186 [details]
Alternative test case: System.Diagnostics.Process

I did some further testing with the original, larger app that produced this error, and I found a second, separate code path that produces the same symptoms without involving the web stack.

This new test case replaces the use of `Socket` with a use of `System.Diagnostics.Process`. The Process runs `ls` and redirects the output via `StartInfo.RedirectStandardOutput = true`.

Based on this new result, I would guess that the problem is within the `socket_io_add()` runtime method. This method is called by:

- `socket_pool_queue()`  (aka `icall_append_io_job()`)

- `System.Diagnostics.Process.ProcessAsyncReader.AddInput()` (via `mono_thread_pool_add()`)

## Steps to reproduce
(Same as steps 2-4 from the original description)

1. Build the app in the Debug configuration and launch it. It doesn't matter if you launch it via Xamarin Studio, Finder, or `lldb`.

2. Soon after launching the app, click the button, and then double-click a file in the new modal panel to select the file and close the panel.

3. If the app doesn't crash, repeat the process of clicking the button and selecting a file, or quit and re-launch the app and try again.

## Result when launched via `lldb`

> * thread #4: tid = 0x13d4d9, 0x95e82cb0 libsystem_pthread.dylib`start_wqthread, stop reason = EXC_BAD_ACCESS (code=2, address=0xb0114fec)
>  * frame #0: 0x95e82cb0 libsystem_pthread.dylib`start_wqthread

## Additional information

For this test case, I found that it was helpful (or maybe even necessary) to create 2 sleeping background threads during app initialization instead of just 1.
Comment 11 Brendan Zagaeski (Xamarin Team, assistant) 2014-07-07 21:17:48 UTC
Created attachment 7292 [details]
Updated test case: pthreads and semaphores

Here is a new, simpler sample that seems to show the same behavior. In this case, I've replaced the C# methods that invoked `socket_io_add()` with a P/Invoke to a small native library. The native library calls a few of the system functions that underly `socket_io_init()`, namely `pthread_create()`, `pthread_attr_setstacksize()`, `semaphore_create()`, and `semaphore_timedwait()`.

## Version information

Tested on Mono and Mono 3.6.1 (fce3972).

## A little more context about the native library

The native library makes 2 calls to `pthread_create()`:
> pthread_create (&thread, &attr, function1, NULL);
> pthread_create (&thread, &attr, function2, &sem);

These calls are meant to represent the following two lines from `socket_io_init()` (threadpool.c):
> mono_thread_create_internal (mono_get_root_domain (), data->wait, data, TRUE, SMALL_STACK);
> threadpool_start_thread (&async_io_tp);

`function1()` just calls `printf()` and returns, but `function2()` keeps running `semaphore_timedwait()` in a loop:
> while (semaphore_timedwait (*sem, ts) != 0)

This loop is meant to represent the following line from `async_invoke_thread()` (threadpool.c):
> while (mono_cq_count (tp->queue) == 0 && (res = mono_sem_timedwait (&tp->new_job, 2000, TRUE)) == -1)

## The stack sizes of the threads seem to be important?

The example native library sets the thread stack size for both threads to the same `SMALL_STACK` size defined in `threadpool.c`. I have so far been unable to reproduce the crash when I set the stack size to 0 or omit the call to `pthread_attr_setstacksize()`.

I've seen various results when changing stack size, both in this small native library and when running `socket_io_init()` in a patched version of the Mono runtime. For example, one "workaround" for the `socket_io_init()` case seems to be to remove the conditional from the following line in `threadpool_start_thread()` (threadpool.c):

> stack_size = (!tp->is_io) ? 0 : SMALL_STACK;

In my tests, either `stack_size = 0` or `stack_size = SMALL_STACK` stopped the crash. Considering this result along with the new test case, it seems the problem might depend on:

1. Having interleaved calls to `pthread_create()` with different stack sizes.
2. Calling `semaphore_timedwait()` in some of those new threads.
3. Starting all of the threads in rapid succession.

## Next step: replace the `ThreadPool.QueueUserWorkItem()` calls with native calls

It's starting to look like this might be a bug or limitation in OS X's threads and semaphores. When I get a chance, I'll see if I can replace the calls to `ThreadPool.QueueUserWorkItem()` and `Thread.Sleep()` with a small number of underlying native functions. If so, I'll move on to trying to "mock up" the GC thread, monitor thread, and `async_tp` thread in an Objetive-C app, and try to reproduce the crash there.
Comment 12 Zoltan Varga 2014-07-08 14:59:12 UTC
I can reproduce this using the last testcase, it happens pretty randomly. Under gdb, the crash manifests itself as:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0xb0114fec
0x9326ccb0 in start_wqthread ()

(gdb) x/1i $pc
0x9326ccb0 <start_wqthread>:	push   %ebp

The crash happens because esp points to invalid memory. This function is called by the kernel, so there is no stack trace.
Comment 13 Zoltan Varga 2014-07-10 12:20:59 UTC
This is probably a regression caused by d6673ca8ec854f291eb32c048446b3868b92de7a. We use the stack size to mprotect () a part of the thread stack to be able to catch stack overflows, but if the stack size calculation is wrong, we mprotect some random memory causing these random crashes.
Comment 14 Zoltan Varga 2014-07-10 17:58:12 UTC
Comitted a fix to mono master in 7892d21b5be6f1008cfa63b3943d2ae571720aa3.
Comment 15 Brendan Zagaeski (Xamarin Team, assistant) 2014-07-16 22:42:25 UTC
Thanks again for the fix!

I have now re-checked the attachments from the description, from comment 8, and from comment 10. I've also tested the full app where the problem was originally found. All of these now run without error for me on Mono (c46cf0e).

QA, feel free to mark this directly as verified, or double-check with the attachment from comment 10 if you like. Thanks!

## Steps to test

1. Launch the test case from comment 10.

2. Soon after launching the app, make sure the app's window has focus, and then press the Return key followed by the down arrow key.

3. Continue to alternate quickly between the Return key and the down arrow key. Approximately 10-15 repetitions should be enough.

My latest test case (comment 11) seems to deadlock if you repeat the repro steps quickly enough, but that is _not_ the same problem from this bug, and I think it's likely just a mistake in my code.