Bug 58032 - Race conditions of threadpool-worker-default.c
Summary: Race conditions of threadpool-worker-default.c
Alias: None
Product: Runtime
Classification: Mono
Component: General ()
Version: master
Hardware: PC Linux
: --- normal
Target Milestone: Future Cycle (TBD)
Assignee: Ludovic Henry
Depends on:
Reported: 2017-07-08 20:06 UTC by Armin
Modified: 2017-07-10 15:27 UTC (History)
3 users (show)

Is this bug a regression?: ---
Last known good build:

Stack traces (29.17 KB, text/plain)
2017-07-08 20:06 UTC, Armin

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.

Please create a new report for Bug 58032 on GitHub or Developer Community if you have new information to add and do not yet see a matching new report.

If the latest results still closely match this report, you can use the original description:

  • Export the original title and description: GitHub Markdown or Developer Community HTML
  • Copy the title and description into the new report. Adjust them to be up-to-date if needed.
  • Add your new information.

In special cases on GitHub you might also want the comments: GitHub Markdown with public comments

Related Links:

Description Armin 2017-07-08 20:06:50 UTC
Created attachment 23422 [details]
Stack traces

In the course of my work with clang's ThreadSanitizer (my GSoC project) I have found three (connected) races within threadpool-worker-default.c (commit 57239e923f3f5942d62e058d7fe4dc01f67c0735) and I am not entirely sure if they are all harmless. I have attached the three corresponding stack traces and I will also try to reconstruct what I think is going on.
It all starts here:

static ThreadPoolWorker worker;

static void heuristic_notify_work_completed (void) {
  // ...
  worker.heuristic_last_dequeue = mono_msec_ticks ();
  if (heuristic_should_adjust ())
    heuristic_adjust ();

`worker.heuristic_last_dequeue` gets updated from at least two different threads concurrently without synchronisation. While this might be dismissible, I would at least document it to a minimal extent, as the value looks monotonic, which might not be true.

Next, there is the call of `heuristic_should_adjust ()`:

static gboolean heuristic_should_adjust (void) {
  if (worker.heuristic_last_dequeue > worker.heuristic_last_adjustment + worker.heuristic_adjustment_interval) {
    // ...
  // ...

`worker.heuristic_last_dequeue` is read and might have just been reset to an unexpected (higher / lower) number by another thread. However, interestingly, I have not found a warning about this particular write/read by tsan yet (I will keep looking).

In addition, `worker.heuristic_last_adjustment` and `worker.heuristic_adjustment_interval` are read and might have been (independently) set by another thread that runs the last piece of this puzzle - `heuristic_adjust ()` and `hill_climbing_update ()`:

static void heuristic_adjust (void) {
  if (mono_coop_mutex_trylock (&worker.heuristic_lock) == 0) {
    // ...
    new_thread_count = hill_climbing_update (counter._.max_working, sample_duration, completions, &worker.heuristic_adjustment_interval);
    // ...
    worker.heuristic_last_adjustment = mono_msec_ticks ();
    // ...
    mono_coop_mutex_unlock (&worker.heuristic_lock);

static gint16 hill_climbing_update (gint16 current_thread_count, guint32 sample_duration, gint32 completions, gint64 *adjustment_interval) {
  // ...
  *adjustment_interval = hc->current_sample_interval;
  // ...

`heuristic_adjust ()` uses a mutex; however, this does not prevent `heuristic_should_adjust ()` from reading (partly) updated values of `worker.heuristic_last_adjustment` and `worker.heuristic_adjustment_interval`.

It looks to me as if this whole situation could use a little more synchronisation.