Bug 43172 - Failure when alerting threads waiting on a socket that closes
Summary: Failure when alerting threads waiting on a socket that closes
Alias: None
Product: Runtime
Classification: Mono
Component: io-layer ()
Version: master
Hardware: PC Linux
: Normal normal
Target Milestone: Future Cycle (TBD)
Assignee: Katelyn Gadd
Depends on:
Blocks: 43727
  Show dependency tree
Reported: 2016-08-08 17:01 UTC by Andi McClure
Modified: 2017-09-06 20:11 UTC (History)
8 users (show)

Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.

Please create a new report for Bug 43172 on GitHub or Developer Community if you have new information to add and do not yet see a matching new report.

If the latest results still closely match this report, you can use the original description:

  • Export the original title and description: GitHub Markdown or Developer Community HTML
  • Copy the title and description into the new report. Adjust them to be up-to-date if needed.
  • Add your new information.

In special cases on GitHub you might also want the comments: GitHub Markdown with public comments

Related Links:

Description Andi McClure 2016-08-08 17:01:43 UTC
MonoTests.System.Net.Sockets.SocketTest.SendAsyncFile is our biggest crash contributor in CI. It fails over 20% of the time on Linux, on all Linux flavors including Android, but never on Mac.

The failure is consistent and looks like:

                                                System.Exception : Could not abort registered blocking threads before closing socket.
Thread StackTrace:
  at System.Net.Sockets.SafeSocketHandle.RegisterForBlockingSyscall () [0x00057] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs:114 
  at System.Net.Sockets.Socket.SendFile_internal (System.Net.Sockets.SafeSocketHandle safeHandle, System.String filename, System.Byte[] pre_buffer, System.Byte[] post_buffer, System.Net.Sockets.TransmitFileOptions flags) [0x00000] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2944 
  at System.Net.Sockets.Socket.SendFile (System.String fileName, System.Byte[] preBuffer, System.Byte[] postBuffer, System.Net.Sockets.TransmitFileOptions flags) [0x00028] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2893


Comment 1 Alexander Köplinger [MSFT] 2016-08-16 21:43:42 UTC
Marcos is already looking at this, moving to him.

I'm not sure if it's really worth of C8 milestone since it'll only crash in CI (see  https://github.com/mono/mono/blob/02f5cd35f23e89c0e8f66ba08f32bfa7f6f4ea74/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs#L28)
Comment 2 Andi McClure 2016-08-16 22:32:00 UTC
Agreed that this isn't necessary for C8 if it doesn't impact non-CI users.
Comment 3 Katelyn Gadd 2017-07-31 21:04:18 UTC
I can't reproduce this on my ubuntu x64 setup. Does it depend on something specific about the test bots? Is it fixed?
Comment 4 Alexander Köplinger [MSFT] 2017-08-25 12:49:53 UTC
I'm disabling the test for now with https://github.com/mono/mono/pull/5447.

@Katelyn I know you're working on fixing the underlying issue with https://github.com/mono/mono/pull/5345, you'll need to revert my change in your PR when you test it :)
Comment 5 Ludovic Henry 2017-09-06 19:46:20 UTC
Katelyn, did you make progress on this bug? Could you share what you found out so far, even if it's not fixed. Thank you
Comment 6 Katelyn Gadd 2017-09-06 20:11:00 UTC
Currently when we close a socket we manually halt any i/o occurring against the socket, but we do it using some complex logic that registers active i/o threads and cancels them manually. There is some spin wait logic in there along with a bunch of other nuances, and some of it is platform-specific.

In the case of this bug there seems to be an assumption that we can correctly wake up the i/o thread when it's performing a sendfile, but that's not actually the case. When I reproduce this failure the sendfile operation stays stuck forever, and our socket close logic gives up because it wants to ensure it has cancelled all outstanding operations, and it does this *before* closing the socket. Because this is implemented by us, however, it's possible the actual kernel-level sendfile op completed and we're stuck somewhere in managed code - I was never able to catch this in action inside the debugger.

My current approach to fixing it is to close the socket before manually waking up our i/o threads, because under normal circumstances closing a socket will terminate any outstanding i/o against the socket (and the behavior here is at least somewhat documented on every OS). After the close is complete, any operations that are still pending (due to a kernel bug or otherwise - this close logic apparently exists to work around a kernel bug on one platform) will get cleaned up by our elaborate logic but it will no longer be necessary to involve it any time a socket is closed.

I implemented that fix and it eliminated this issue, but unfortunately the fix revealed some race conditions and faulty assumptions in other tests - we were relying on this workaround in order to make some bad/unreliable socket code in tests behave the way we wanted, when in the real world it would produce errors or unexpected results. I'm still trying to chase down the issues the fix revealed. Fully tracking this down requires some infrastructure work to make it possible to investigate this sort of failure, because we have a lot of code in our tests that suffers from the same issues.