Bug 58633 - Xamarin.iOS Threadpool randomly halts when app is suspended and doesn't recover when resumed
Summary: Xamarin.iOS Threadpool randomly halts when app is suspended and doesn't recov...
Status: CONFIRMED
Alias: None
Product: iOS
Classification: Xamarin
Component: BCL Class Libraries (show other bugs)
Version: XI 10.10 (d15-2)
Hardware: PC Windows
: --- normal
Target Milestone: Untriaged
Assignee: Ludovic Henry
URL:
Depends on:
Blocks:
 
Reported: 2017-08-08 00:18 UTC by Jahmai
Modified: 2017-11-22 16:25 UTC (History)
5 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
All log output during reproduction (6.71 MB, application/x-7z-compressed)
2017-08-08 00:21 UTC, Jahmai
Details
Filtered log output during reproduction (2.88 MB, application/x-7z-compressed)
2017-08-08 00:22 UTC, Jahmai
Details
Snippet log showing what happens between success / failure (28.67 KB, application/x-7z-compressed)
2017-08-08 00:33 UTC, Jahmai
Details

Description Jahmai 2017-08-08 00:18:59 UTC
I've been witnessing this issue for months and smashing my face against a wall trying to debug it for the past few weeks.

The high level side effect is that all TPL and Threadpool based operations may stop functioning after a Xamarin.iOS app is 'suspended'.

In our applications case, since we use Background location, Geofencing, Background fetch and Bluetooth, there are a number of conditions where iOS can suspend our application, wake it to perform an event, and then suspend it again, significantly increasing the likelihood of reproducing this issue.

Bringing the application into the foreground doesn't repair the Threadpool. In this case the main thread happily executes, as do many of the normal iOS threads, but no operations on the Threadpool work (including the vast majority of TPL async/await, Task.Run, Task.Factory.StartNew etc). The only thing that does work is Task.Factory.StartNew(..., TaskCreationOptions.LongRunning).

I've used the following code to monitor the state continuously:

            var tplCheckTimer = NSTimer.CreateRepeatingScheduledTimer(
                TimeSpan.FromMinutes(1),
                _ =>
                {
                    ThreadPool.GetAvailableThreads(out var availableWorkerThreads, out var availablePortThreads);
                    ThreadPool.GetMinThreads(out var minWorkerThreads, out var minPortThreads);
                    ThreadPool.GetMaxThreads(out var maxWorkerThreads, out var maxPortThreads);

                    bool taskRun = Task.Run(() => { }).Wait(TimeSpan.FromSeconds(1));
                    bool taskStartNew = Task.Factory.StartNew(() => { }).Wait(TimeSpan.FromSeconds(1));
                    bool taskStartNewLongRunning = Task.Factory.StartNew(() => { }, TaskCreationOptions.LongRunning).Wait(TimeSpan.FromSeconds(1));
                    ManualResetEventSlim queueUserWorkItemResetEvent = new ManualResetEventSlim();
                    bool queueUserWorkItem = ThreadPool.QueueUserWorkItem(__ => queueUserWorkItemResetEvent.Set());
                    bool queueUserWorkItemWaited = queueUserWorkItemResetEvent.Wait(TimeSpan.FromSeconds(1));
                    ManualResetEventSlim unsafeQueueUserWorkItemResetEvent = new ManualResetEventSlim();
                    bool unsafeQueueUserWorkItem = ThreadPool.UnsafeQueueUserWorkItem(__ => unsafeQueueUserWorkItemResetEvent.Set(), null);
                    bool unsafeQueueUserWorkItemWaited = unsafeQueueUserWorkItemResetEvent.Wait(TimeSpan.FromSeconds(1));

                    Tracer.For(this).Verbose($"UnsafeQueueUserWorkItem: {unsafeQueueUserWorkItem}, UnsafeQueueUserWorkItemWaited: {unsafeQueueUserWorkItemWaited}, QueueUserWorkItem: {queueUserWorkItem}, QueueUserWorkItemWaited: {queueUserWorkItemWaited}, TaskRun: {taskRun}, TaskStartNew: {taskStartNew}, TaskStartNewLongRunning: {taskStartNewLongRunning}, HasShutdownStarted: {Environment.HasShutdownStarted}, WillTerminate: {_terminating}, Worker Threads: {minWorkerThreads}/{availableWorkerThreads}/{maxWorkerThreads}, Port Threads: {minPortThreads}/{availablePortThreads}/{maxPortThreads}");
                });

            NSRunLoop.Main.AddTimer(tplCheckTimer, NSRunLoopMode.Default);


During normal operation, we see log entries like this:

Aug  7 19:12:23 iPhone SafeZone[86676] <Notice>: 2017-08-07 09:12:23.647Z 00.0000000 * 0 Verbose UnsafeQueueUserWorkItem: True, UnsafeQueueUserWorkItemWaited: True, QueueUserWorkItem: True, QueueUserWorkItemWaited: True, TaskRun: True, TaskStartNew: True, TaskStartNewLongRunning: True, HasShutdownStarted: False, WillTerminate: False, Worker Threads: 2/200/200, Port Threads: 2/200/200

Once the issue is reproduced, we see log entries like this (and it NEVER recovers):

Aug  7 21:31:43 iPhone SafeZone[86676] <Notice>: 2017-08-07 11:31:43.654Z 00.0000000 * 0 Verbose UnsafeQueueUserWorkItem: True, UnsafeQueueUserWorkItemWaited: False, QueueUserWorkItem: True, QueueUserWorkItemWaited: False, TaskRun: False, TaskStartNew: False, TaskStartNewLongRunning: True, HasShutdownStarted: False, WillTerminate: False, Worker Threads: 2/200/200, Port Threads: 2/200/200

However, despite this knowledge, I have been completely unable to create a small reproduction program for it.

DUE TO THE SERIOUSNESS OF THIS ISSUE I BELIEVE THIS REQUIRES INVESTIGATION BY SOMEONE WITH INTIMATE KNOWLEDGE OF BOTH THE MONO THREADPOOL AND IOS APP LIFECYCLE.

I have attached the log of a full reproduction of the issue from a device with mono debug level logging of the threadpool and io-threadpool enabled.

=== Visual Studio Enterprise 2017 for Mac ===
Version 7.0.1 (build 24)
Installation UUID: f6d24073-2aea-4b5b-a922-10d46d88aae9
Runtime:
    Mono 5.0.1.1 (2017-02/5077205) (64-bit)
    GTK+ 2.24.23 (Raleigh theme)
    Package version: 500010001
=== NuGet ===
Version: 4.0.0.2323
=== .NET Core ===
Runtime: Not installed
SDK: Not installed
MSBuild SDKs: /Library/Frameworks/Mono.framework/Versions/5.0.1/lib/mono/msbuild/15.0/bin/Sdks
=== Xamarin.Profiler ===
Version: 1.5.4
Location: /Applications/Xamarin Profiler.app/Contents/MacOS/Xamarin Profiler
=== Apple Developer Tools ===
Xcode 8.3.2 (12175)
Build 8E2002
=== Xamarin.Mac ===
Version: 3.4.0.36 (Visual Studio Enterprise)
=== Xamarin.iOS ===
Version: 10.10.0.36 (Visual Studio Enterprise)
Hash: d2270eec
Branch: d15-2
Build date: 2017-05-22 16:30:53-0400
=== Xamarin.Android ===
Not Installed
=== Xamarin Inspector ===
Version: 1.2.2
Hash: b71b035
Branch: d15-1
Build date: Fri, 21 Apr 2017 17:57:12 GMT
=== Build Information ===
Release ID: 700010024
Git revision: 7ab1ca2ced6f584e56b7a0d4d321d00775cd95c9
Build date: 2017-05-19 05:44:51-04
Xamarin addins: 08d17158f3365beee5e60f67999e607cce4b3f93
Build lane: monodevelop-lion-d15-2
=== Operating System ===
Mac OS X 10.12.5
Darwin 16.6.0 Darwin Kernel Version 16.6.0
    Fri Apr 14 16:21:16 PDT 2017
    root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64
Comment 1 Jahmai 2017-08-08 00:21:56 UTC
Created attachment 24072 [details]
All log output during reproduction
Comment 2 Jahmai 2017-08-08 00:22:52 UTC
Created attachment 24073 [details]
Filtered log output during reproduction
Comment 3 Jahmai 2017-08-08 00:33:20 UTC
Created attachment 24075 [details]
Snippet log showing what happens between success / failure
Comment 4 Vincent Dondain [MSFT] 2017-08-08 23:34:34 UTC
Hi Jahmai,

Thanks for all the information you provided.

Ludovic seems like the right person to help (or redirect) wrt your issue but I wonder if you could produce a test case so that the team can easily reproduce this bug.

An other solution would be to give us access to your project, we could make the bug private so only Xamarin employees can access it.

This would greatly help us understand what's happening in order to fix the issue.
Comment 5 Jahmai 2017-08-09 05:20:25 UTC
@Vincent

I'm happy to try produce a test case but so far haven't had luck. I don't think it is realistic to use our app as it depends on backend infrastructure and specific system configuration to create the circumstances for it to happen.

I'd prefer it if I could engage with the developer who will be working on this though to come up with the test case so that they atleast appreciated what was happening in theory.

What I want to avoid is predicating any investigation of this issue on some concise test case that I may not be able to produce.

For example, reproducing the issue in our own app, requires running using another Swift application from XCode and using a GPX file to force simulation of traversing a geo-path that causes geofencing events to keep firing inside our application. The reproduction can take from 4-18 hours (since you need to allow up to 20 minutes after leaving a geofence for the app to be suspended, and an additional 5-15 minutes for the geofence entry to be triggered to wake up the app, and it only seems to happen after the app has been awoken and then has time to get suspended).
Comment 6 Jahmai 2017-08-09 06:11:47 UTC
I forgot to mention, worst of all it seems unreproducible with the debugger connected.
Comment 7 Ludovic Henry 2017-08-10 08:02:37 UTC
Hello,

Could you please run with 'MONO_LOG_LEVEL=debug' and 'MONO_LOG_MASK=threadpool' environment variables defined on device? That would give us a good idea of what's happening inside the threadpool.

Thank you,
Comment 8 Jahmai 2017-08-10 09:03:22 UTC
@Ludovic,

Way ahead of you bud. If you look carefully you'll note that I already attached logs with those environment variables.

Cheers.
Comment 9 Jahmai 2017-08-11 06:48:12 UTC
@Ludovic, Just wanna make sure you got my last comment :)
Comment 10 Ludovic Henry 2017-08-14 13:27:26 UTC
I saw your previous comment. It will take me some time to go through the log, so I will keep you posted as soon as I need more information. Thank you.
Comment 11 Jahmai 2017-09-05 07:05:34 UTC
@Ludovic

It's been a few weeks, have you read the logs yet?
Comment 12 Jahmai 2017-09-12 00:37:40 UTC
This issue is a disaster, here is what I've had to do to work around the problem (still longevity testing this, too):

- Discover SmartThreadPool.dll, a fully managed .net threadpool, fork and compile as netstandard2
- Write our own System.Threading.Timer implementation because the default Timer uses Threadpool.QueueUserWorkItem
- Write a TaskEx.Delay to replace Task.Delay which uses our injected custom Timer to avoid using Threadpool.QueueUserWorkItem
- Write a SmartThreadPoolTaskScheduler that sits on another SmartThreadPool instead of the default Threadpool.QueueUserWorkItem
- Inject the new task scheduler into TaskScheduler.Default using a reflection hack, so that all instances of Task.Run and async/await use the smart thread pool instead of Threadpool.QueueUserWorkItem (wasn't going to manually update every single async call to use a custom scheduler)
- Force System.IO.Stream calls to be Task.Run(() => stream.Read) instead of using the default IO completion routine (e.g. NetworkStream / FileStream) based async calls since the IO completion threads in the native threadpool are also affected by this issue

I think this is the most vulgar issue I've ever had to work around, edging in front of even the absurdity that https://github.com/dotnet/corefx/issues/11100 caused.
Comment 13 Jahmai 2017-09-25 08:44:53 UTC
Ok after longevity testing I was able to reproduce the issue even using the new ThreadPool, only it took longer.

After a much more aggressive and targeted testing, this appears to be related to spinning down/up threads in a pool.

I am now trying a much simpler approach and will feed back success after further testing:

            // Don't allow the threadpool to shrink or grow at all
            ThreadPool.SetMinThreads(25, 25);
            ThreadPool.SetMaxThreads(25, 25);
Comment 14 Jahmai 2017-09-27 12:34:43 UTC
The workaround in my previous comment appears to be working. Still stable about 53 hours which is 17 hours longer than my previous best.

So that means the problem is related to threads somehow failing to start on demand when there is work to do, but threads already created continue to run happily.
Comment 15 Jahmai 2017-09-30 11:29:55 UTC
To my great dismay, the above change wasn't enough. Now trying all hacks combined.
Comment 16 Jahmai 2017-10-02 00:21:53 UTC
Quick follow up. This code:

            ThreadPool.SetMinThreads(25, 25);
            ThreadPool.SetMaxThreads(25, 25);

Was NOT actually setting the thread counts correctly (verified by GetMin/MaxThreads), but reversing the lines works, seems like a bug:

            ThreadPool.SetMaxThreads(25, 25);
            ThreadPool.SetMinThreads(25, 25);

But it didn't matter, because the thread pool still starts and stops threads anyway, observed by these entries while the application is running:

Thread finished: <Thread Pool> #76
Thread finished: <Thread Pool> #73
Thread started: <Thread Pool> #75
Thread started: <Thread Pool> #76
Thread started: <Thread Pool> #77
Thread finished: <Thread Pool> #74
Thread finished: <Thread Pool> #77
Thread finished: <Thread Pool> #75
Thread started: <Thread Pool> #78
Thread started: <Thread Pool> #74

The second thing I noticed was that the custom threadpool I was using doesn't set a SynchronizationContext on each thread that is started that will send/post things back to the owning threadpool for things like await's. I've now made that change and am testing again.
Comment 17 Jahmai 2017-10-04 05:57:59 UTC
More things to avoid:

Task.Yield()
TaskCreationOptions.RunContinuationsAsynchronously
Comment 18 Jahmai 2017-10-15 00:54:32 UTC
I've successfully worked around this issue in our applications by avoiding the mono threadpool everywhere that matters. Best of luck fixing this for other users
Comment 19 Alex Soto [MSFT] 2017-11-22 16:25:45 UTC
Thank you for providing all the information Jahmai.
@Ludovic have you had any chance to look into this?

Note You need to log in before you can comment on or make changes to this bug.