Bug 26307 - Random crash due to threading issue in new threadpool heuristics code
Summary: Random crash due to threading issue in new threadpool heuristics code
Alias: None
Product: Runtime
Classification: Mono
Component: General ()
Version: unspecified
Hardware: Macintosh Mac OS
: High normal
Target Milestone: ---
Assignee: Ludovic Henry
: 28062 ()
Depends on:
Blocks: 26790
  Show dependency tree
Reported: 2015-01-22 16:59 UTC by Martin Potter
Modified: 2015-05-27 10:10 UTC (History)
7 users (show)

Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.

Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:

Description Martin Potter 2015-01-22 16:59:02 UTC
There are issues with the monitor thread and new heuristics code introduced in Mono 3.12 in this commit https://github.com/mono/mono/commit/c93f12021147713216080e815267d6f7c857c732. We are currently experiencing a random crash, EXC_BAD_ACCESS, when evaluating a thread's state in the monitor_heuristic method at https://github.com/mono/mono/blob/master/mono/metadata/threadpool.c#L846. When the crash occurs, threads->len > tp->nthreads, which does not look like it should ever be the case but the difference between these two values increases the longer the program runs.

I have tracked this down to when the monitor thread creates a new threadpool thread (https://github.com/mono/mono/blob/master/mono/metadata/threadpool.c#L958). 

1: Monitor thread -> create new thread (https://github.com/mono/mono/blob/master/mono/metadata/threadpool.c#L1198)

2: New thread -> there is no work to be done, so thread removes itself from the threads array (which it hasn't been added to yet) https://github.com/mono/mono/blob/master/mono/metadata/threadpool.c#L1726-L1729 

3: Monitor thread -> adds the thread (which has now exited so the pointer is invalid) to the threads array https://github.com/mono/mono/blob/master/mono/metadata/threadpool.c#L1202

The most obvious case is when the application crashes because we no longer "own" the memory pointed to by the garbage pointer; otherwise it is evaluating junk data.

The new heuristics code also appears to create and immediately destroy a lot of threads which occurs when step 3 happens before step 2.
Comment 1 Ludovic Henry 2015-02-11 13:06:32 UTC
This issue should be fixed by https://github.com/mono/mono/commit/2c524916b74f43bc7b64e64dfbec5e1a2b90af85
Comment 2 Martin Potter 2015-02-11 13:14:07 UTC

While I agree that your commit should prevent the crash caused by adding a thread that is no longer around, I believe it fails to fix the underlying issue noted at the bottom of my initial report. The new code still creates threads just to have them be immediately destroyed which is the underlying cause of the reported crash.
Comment 3 Rodrigo Kumpera 2015-02-11 13:23:13 UTC
I agree with Martin,

We have found some workloads where this fast create/destroy cycle is quite troubling.
Comment 4 Narinder 2015-02-27 10:48:21 UTC

We have just attempted to upgrade our Mono 3.10.0 to 3.12.0 and subsequently observed regressions.

The setup is :
Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux

The regressions seemed to be around our thread-based code. Using git bisect we were able to deduce the probable guilty commit :

[mono] ~/mono @ git bisect good
c93f12021147713216080e815267d6f7c857c732 is the first bad commit
commit c93f12021147713216080e815267d6f7c857c732
Author: Ludovic Henry <ludovic.henry@xamarin.com>
Date:   Tue Jul 22 12:16:45 2014 -0400

    [threadpool] Add heuristic to approach the optimal number of threads

:040000 040000 3786dcb292de059604d8638e5ffc093b57cd2ad0 c122e302b25fe6bda3fd889038ca41f68f680b42 M	mono

I suspect it is related to this issue. Next week I will try to post some more helpful details about the regression.

Comment 5 Jordan Appleson 2015-03-16 13:25:09 UTC
Is the above a fix for the following?

Comment 6 Martin Potter 2015-03-16 13:52:16 UTC

It looks like the same crash (SIGSEGV) we were getting due to invalid threads being in the list. The commit mentioned above should prevent that crash, but there is still a high churn rate in the number of threads with the new heuristics.
Comment 7 Ludovic Henry 2015-03-16 13:58:41 UTC

This crash is fixed by 2c524916b74f43bc7b64e64dfbec5e1a2b90af85. It has not be applied to 3.12, but I am currently working on getting it backported.

It does not fix the creation+suppression of useless threads though. This will be fixed with the importation of the Microsoft ThreadPool from reference source.
Comment 8 Ludovic Henry 2015-03-16 13:59:32 UTC
*** Bug 28062 has been marked as a duplicate of this bug. ***
Comment 9 Jordan Appleson 2015-03-16 14:00:34 UTC
Brilliant, I was looking to see if it was backported so I'll keep an eye it it. For now I might just compile a 3.12 hotfix with this change.
Comment 10 Jordan Appleson 2015-03-18 05:50:08 UTC
Just for reference, I compiled 3.12.1 with the 2c524916b74f43bc7b64e64dfbec5e1a2b90af85 changeset and deployed it to 8 production machines on Monday. None of them have crashed yet. #winning