Bug 43172 - Failure when alerting threads waiting on a socket that closes
Summary: Failure when alerting threads waiting on a socket that closes
Status: CONFIRMED
Alias: None
Product: Runtime
Classification: Mono
Component: io-layer (show other bugs)
Version: master
Hardware: PC Linux
: Normal normal
Target Milestone: Future Cycle (TBD)
Assignee: Katelyn Gadd
URL:
Depends on:
Blocks: 43727
  Show dependency tree
 
Reported: 2016-08-08 17:01 UTC by Andi McClure
Modified: 2017-09-06 20:11 UTC (History)
8 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments

Description Andi McClure 2016-08-08 17:01:43 UTC
MonoTests.System.Net.Sockets.SocketTest.SendAsyncFile is our biggest crash contributor in CI. It fails over 20% of the time on Linux, on all Linux flavors including Android, but never on Mac.

The failure is consistent and looks like:

                                                MESSAGE:
                                                System.Exception : Could not abort registered blocking threads before closing socket.
Thread StackTrace:
  at System.Net.Sockets.SafeSocketHandle.RegisterForBlockingSyscall () [0x00057] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs:114 
  at System.Net.Sockets.Socket.SendFile_internal (System.Net.Sockets.SafeSocketHandle safeHandle, System.String filename, System.Byte[] pre_buffer, System.Byte[] post_buffer, System.Net.Sockets.TransmitFileOptions flags) [0x00000] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2944 
  at System.Net.Sockets.Socket.SendFile (System.String fileName, System.Byte[] preBuffer, System.Byte[] postBuffer, System.Net.Sockets.TransmitFileOptions flags) [0x00028] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2893
[snip] 

Examples:

https://jenkins.mono-project.com/job/test-mono-mainline-linux/label=ubuntu-1404-amd64/556/testReport/MonoTests.System.Net.Sockets/SocketTest/SendAsyncFile/
https://jenkins.mono-project.com/job/test-mono-mainline-linux/label=ubuntu-1404-i386/558/testReport/MonoTests.System.Net.Sockets/SocketTest/SendAsyncFile/
Comment 1 Alexander Köplinger [MSFT] 2016-08-16 21:43:42 UTC
Marcos is already looking at this, moving to him.

I'm not sure if it's really worth of C8 milestone since it'll only crash in CI (see  https://github.com/mono/mono/blob/02f5cd35f23e89c0e8f66ba08f32bfa7f6f4ea74/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs#L28)
Comment 2 Andi McClure 2016-08-16 22:32:00 UTC
Agreed that this isn't necessary for C8 if it doesn't impact non-CI users.
Comment 3 Katelyn Gadd 2017-07-31 21:04:18 UTC
I can't reproduce this on my ubuntu x64 setup. Does it depend on something specific about the test bots? Is it fixed?
Comment 4 Alexander Köplinger [MSFT] 2017-08-25 12:49:53 UTC
I'm disabling the test for now with https://github.com/mono/mono/pull/5447.

@Katelyn I know you're working on fixing the underlying issue with https://github.com/mono/mono/pull/5345, you'll need to revert my change in your PR when you test it :)
Comment 5 Ludovic Henry 2017-09-06 19:46:20 UTC
Katelyn, did you make progress on this bug? Could you share what you found out so far, even if it's not fixed. Thank you
Comment 6 Katelyn Gadd 2017-09-06 20:11:00 UTC
Currently when we close a socket we manually halt any i/o occurring against the socket, but we do it using some complex logic that registers active i/o threads and cancels them manually. There is some spin wait logic in there along with a bunch of other nuances, and some of it is platform-specific.

In the case of this bug there seems to be an assumption that we can correctly wake up the i/o thread when it's performing a sendfile, but that's not actually the case. When I reproduce this failure the sendfile operation stays stuck forever, and our socket close logic gives up because it wants to ensure it has cancelled all outstanding operations, and it does this *before* closing the socket. Because this is implemented by us, however, it's possible the actual kernel-level sendfile op completed and we're stuck somewhere in managed code - I was never able to catch this in action inside the debugger.

My current approach to fixing it is to close the socket before manually waking up our i/o threads, because under normal circumstances closing a socket will terminate any outstanding i/o against the socket (and the behavior here is at least somewhat documented on every OS). After the close is complete, any operations that are still pending (due to a kernel bug or otherwise - this close logic apparently exists to work around a kernel bug on one platform) will get cleaned up by our elaborate logic but it will no longer be necessary to involve it any time a socket is closed.

I implemented that fix and it eliminated this issue, but unfortunately the fix revealed some race conditions and faulty assumptions in other tests - we were relying on this workaround in order to make some bad/unreliable socket code in tests behave the way we wanted, when in the real world it would produce errors or unexpected results. I'm still trying to chase down the issues the fix revealed. Fully tracking this down requires some infrastructure work to make it possible to investigate this sort of failure, because we have a lot of code in our tests that suffers from the same issues.

Note You need to log in before you can comment on or make changes to this bug.