Bug 38012

Summary: Using Task's under memory pressure leads to unexpected crashes inside the TPL on iOS
Product: iOS Reporter: T.J. Purtell <tj>
Component: Mono runtime / AOT compilerAssignee: Ludovic Henry <ludovic>
Severity: normal CC: kumpera, mono-bugs+monotouch, Rajneeshk
Priority: ---    
Version: XI 9.4 (iOS 9.2)   
Target Milestone: Untriaged   
Hardware: Macintosh   
OS: Mac OS   
Tags: Is this bug a regression?: ---
Last known good build:

Description T.J. Purtell 2016-01-25 20:34:39 UTC
I have been investigating a crash inside our app which seems to be related to our image loading process.  Our image loading process takes heavy advantage of TPL+async/await.  In the process of trying to extract out a useful demonstration of the crash, I started building a small sample.  It turns out that there is a crash that can be produced without loading any images at all.  The requirement seems to be to cause memory pressure while running many parallel tasks. 

The test case does two things (1) runs a loop that allocates memory and updates the UI with the total memory in use to show its not leaking and (2) runs 10 parallel loops of simples task that yield, and then return a value.

The sample code producing the error can be found here: https://bitbucket.org/tpurtell/ios-gc-task-crash .  It reproduces quickly on my iOS device (9.2.1) in debug mode (regardless of if the debugger is attached).  I can't seem to reproduce it on the simulator or release.  It produces these two unhandled exception (reported as NullPointerException via unhandled or debugger UI).

>2016-01-22 18:21:55.676 task-crash[12319:3127110] 
>Unhandled Exception:
>0   task-crash                          0x002b930f mono_handle_exception_internal + 2306
>1   task-crash                          0x002b8a07 mono_handle_exception + 30
>2   task-crash                          0x002b37d5 handle_signal_exception + 48
>3   ???                                 0x17140a08 0x0 + 387189256
>at System.Threading.Tasks.Task.FinishStageThree () [0x00045] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2387
>at System.Threading.Tasks.Task`1<T_REF>.TrySetResult (T_REF) <0x000d0>
>at System.Threading.Tasks.UnwrapPromise`1<T_REF>.TrySetFromTask (System.Threading.Tasks.Task,bool) <0x002b0>
>at System.Threading.Tasks.UnwrapPromise`1<T_REF>.ProcessInnerTask (System.Threading.Tasks.Task) <0x000a0>
>at System.Threading.Tasks.UnwrapPromise`1<T_REF>.ProcessCompletedOuterTask (System.Threading.Tasks.Task) <0x00168>
>at System.Threading.Tasks.UnwrapPromise`1<T_REF>.InvokeCore (System.Threading.Tasks.Task) <0x00048>
>at System.Threading.Tasks.UnwrapPromise`1<T_REF>.Invoke (System.Threading.Tasks.Task) <0x00050>
>at System.Threading.Tasks.Task.FinishContinuations () [0x0007c] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:3661
>at System.Threading.Tasks.Task.FinishStageThree () [0x00045] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2387
>at System.Threading.Tasks.Task.FinishStageTwo () [0x00074] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2358
>at System.Threading.Tasks.Task.Finish (bool) [0x00049] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2252
>at System.Threading.Tasks.Task.ExecuteWithThreadLocal (System.Threading.Tasks.Task&) [0x00068] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2857
>at System.Threading.Tasks.Task.ExecuteEntry (bool) [0x0006f] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2781
>at System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () [0x00000] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/Task.cs:2728
>at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00096] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/threadpool.cs:859
>at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () [0x00000] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/threadpool.cs:1196
>at (wrapper runtime-invoke) object.runtime_invoke_dynamic (intptr,intptr,intptr,intptr) <0x00100>
>21  task-crash                          0x002c237f mono_jit_runtime_invoke + 1150
>22  task-crash                          0x00300cf5 mono_runtime_invoke + 88
>23  task-crash                          0x0031a653 worker_thread + 930
>24  task-crash                          0x0031e355 start_wrapper + 400
>25  task-crash                          0x0034b075 inner_start_thread + 148
>26  libsystem_pthread.dylib             0x20ac2c7f <redacted> + 138
>27  libsystem_pthread.dylib             0x20ac2bf3 _pthread_start + 110
>28  libsystem_pthread.dylib             0x20ac0a08 thread_start + 8

>2016-01-25 11:28:17.650 task-crash[15146:3593921] 
>Unhandled Exception:
>0   task-crash                          0x002ad30f mono_handle_exception_internal + 2306
>1   task-crash                          0x002aca07 mono_handle_exception + 30
>2   task-crash                          0x002a6ec3 mono_arm_throw_exception + 106
>3   task-crash                          0x0025a9b8 throw_exception + 64
>at System.Threading.Tasks.AwaitTaskContinuation.<ThrowAsyncIfNecessary>m__0 (object) [0x00000] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/Tasks/TaskContinuation.cs:885
>at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context (object) [0x0000e] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/threadpool.cs:1291
>at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00081] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/executioncontext.cs:581
>at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00000] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/executioncontext.cs:530
>at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () [0x0002a] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/threadpool.cs:1268
>at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00096] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/threadpool.cs:859
>at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () [0x00000] in /Users/builder/data/lanes/2799/179ef070/source/maccore/_build/Library/Frameworks/Xamarin.iOS.framework/Versions/git/src/mono/external/referencesource/mscorlib/system/threading/threadpool.cs:1196
>at (wrapper runtime-invoke) object.runtime_invoke_dynamic (intptr,intptr,intptr,intptr) <0x00100>
>12  task-crash                          0x002b637f mono_jit_runtime_invoke + 1150
>13  task-crash                          0x002f4cf5 mono_runtime_invoke + 88
>14  task-crash                          0x0030e653 worker_thread + 930
>15  task-crash                          0x00312355 start_wrapper + 400
>16  task-crash                          0x0033f075 inner_start_thread + 148
>17  libsystem_pthread.dylib             0x20ac2c7f <redacted> + 138
>18  libsystem_pthread.dylib             0x20ac2bf3 _pthread_start + 110
>19  libsystem_pthread.dylib             0x20ac0a08 thread_start + 8
Comment 1 Rodrigo Kumpera 2016-01-26 00:10:55 UTC
Hey Ludo,

We got a TPL bug here.
Comment 2 Ludovic Henry 2016-01-26 11:16:47 UTC
Hello TJ,

Thank you very much for the very detailed report and simple repro! I will work on that right away.

For the second crash, it comes from the fact that an exception has bubbled up the async/await code. The way the TPL does it in this case, is to enqueue a new work item on the threadpool which is simply going to rethrow the exception. This has the effect of crashing the process. This is the expected behavior in case of an unhandled exception on a threadpool thread. I will investigate further where this exception comes from.
Comment 3 Ludovic Henry 2016-01-29 19:11:41 UTC
Hello TJ,

The issue lie in the way the GC scan registers on ARM. We are missing some of them leading to a object being freed, while it's still accessible. The fix is available here : https://github.com/mono/mono/pull/2540. With it, I can run for tens of thousands of collections without a crash.

I am checking with @kumpera when and we were we will have a release with this fix.

Thank you again,
Comment 4 T.J. Purtell 2016-01-29 19:33:02 UTC
Interesting, thank you very much for tracking this down! :)
Comment 5 T.J. Purtell 2016-02-02 06:06:25 UTC
It definitely seems a bit weird that PC, SP, or LR have a GC root that isn't scanned for other reasons.  The patch seems to add a lot of roots and perhaps this alters the timing so that it makes the problem "disappear".
Comment 6 Ludovic Henry 2016-02-02 13:36:37 UTC
The previous issue arise when we would suspend the thread on the loopback of Interlocked.Exchange. The implementation is as follow:

> 0x17220c <+92>:   dmb    sy // full memory barrier
> 0x172210 <+96>:   ldrex  lr, [r0] // lr= *r0
> 0x172214 <+100>:  strex  r12, r1, [r0] // *r0 = r1, r12 = succcess?
> 0x172218 <+104>:  cmp    r12, #0 // r12 == success?
> 0x17221c <+108>:  bne    0x17220c // loop back if not success - it would crash if suspended here
> ...
> 0x172228 <+120>:  mov    r1, lr // r1 = lr, we would scan r1, but it's too late.

The issue was we wouldn't scan the LR register as it's at word 15 of MonoContext, but on ARM we would only scan 14 words. We would then not pin, or even mark the obj which has just been CAS. We would have garbage in LR reg as it wouldn't point to the appropriate object anymore (potentially garbage).

The time taken by scanning here is not relevant, as it's the location of the suspend that had an influence. So scanning more regs will not change anything, and even performance wise, we now scan 34 words, instead of 14, so that will not change anything substantial.
Comment 7 T.J. Purtell 2016-02-02 18:38:23 UTC
Ah, I see.  The special named register is not so special here :).  Now, I can see why this was so tricky to track down.  I am hopeful this might fix some other random crash type issues that I see from the field, but never was able to get a concrete repro case for, since it seems it would affect many things including concurrent collection types.  Excited to try it out!
Comment 8 Rajneesh Kumar 2016-03-01 14:06:28 UTC
I have checked this issue and, I am successfully able to build and run attached test sample provided in Bug description.
[Build Info: Master]

Screencast: http://www.screencast.com/t/I8gWAmMRtKh

This issue has been fixed in master, I will Re-Verify this issue when fix will merged in release branch.

Device info: iPhone 4s iOS Version 9.2(13C75)
Environment Info: https://gist.github.com/Rajneesh360Logica/323f4278de7f7ea50229
Application Output: https://gist.github.com/Rajneesh360Logica/8e7ba9298b44c72314a2
Comment 9 Rajneesh Kumar 2016-03-03 13:47:41 UTC
I have checked this issue with the following build from C6SR2:

I observed that this issue is working fine with this build. I am successfully able to build and run attached test sample provided in Bug 38012 description.

Screencast: http://www.screencast.com/t/ySisudGS
Device info: iPhone 4s iOS Version 9.2(13C75)

This issue has been fixed, hence I am closing this issue.