MonoTests.System.Net.Sockets.SocketTest.SendAsyncFile is our biggest crash contributor in CI. It fails over 20% of the time on Linux, on all Linux flavors including Android, but never on Mac.
The failure is consistent and looks like:
System.Exception : Could not abort registered blocking threads before closing socket.
at System.Net.Sockets.SafeSocketHandle.RegisterForBlockingSyscall () [0x00057] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs:114
at System.Net.Sockets.Socket.SendFile_internal (System.Net.Sockets.SafeSocketHandle safeHandle, System.String filename, System.Byte pre_buffer, System.Byte post_buffer, System.Net.Sockets.TransmitFileOptions flags) [0x00000] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2944
at System.Net.Sockets.Socket.SendFile (System.String fileName, System.Byte preBuffer, System.Byte postBuffer, System.Net.Sockets.TransmitFileOptions flags) [0x00028] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2893
Marcos is already looking at this, moving to him.
I'm not sure if it's really worth of C8 milestone since it'll only crash in CI (see https://github.com/mono/mono/blob/02f5cd35f23e89c0e8f66ba08f32bfa7f6f4ea74/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs#L28)
Agreed that this isn't necessary for C8 if it doesn't impact non-CI users.
I can't reproduce this on my ubuntu x64 setup. Does it depend on something specific about the test bots? Is it fixed?
I'm disabling the test for now with https://github.com/mono/mono/pull/5447.
@Katelyn I know you're working on fixing the underlying issue with https://github.com/mono/mono/pull/5345, you'll need to revert my change in your PR when you test it :)
Katelyn, did you make progress on this bug? Could you share what you found out so far, even if it's not fixed. Thank you
Currently when we close a socket we manually halt any i/o occurring against the socket, but we do it using some complex logic that registers active i/o threads and cancels them manually. There is some spin wait logic in there along with a bunch of other nuances, and some of it is platform-specific.
In the case of this bug there seems to be an assumption that we can correctly wake up the i/o thread when it's performing a sendfile, but that's not actually the case. When I reproduce this failure the sendfile operation stays stuck forever, and our socket close logic gives up because it wants to ensure it has cancelled all outstanding operations, and it does this *before* closing the socket. Because this is implemented by us, however, it's possible the actual kernel-level sendfile op completed and we're stuck somewhere in managed code - I was never able to catch this in action inside the debugger.
My current approach to fixing it is to close the socket before manually waking up our i/o threads, because under normal circumstances closing a socket will terminate any outstanding i/o against the socket (and the behavior here is at least somewhat documented on every OS). After the close is complete, any operations that are still pending (due to a kernel bug or otherwise - this close logic apparently exists to work around a kernel bug on one platform) will get cleaned up by our elaborate logic but it will no longer be necessary to involve it any time a socket is closed.
I implemented that fix and it eliminated this issue, but unfortunately the fix revealed some race conditions and faulty assumptions in other tests - we were relying on this workaround in order to make some bad/unreliable socket code in tests behave the way we wanted, when in the real world it would produce errors or unexpected results. I'm still trying to chase down the issues the fix revealed. Fully tracking this down requires some infrastructure work to make it possible to investigate this sort of failure, because we have a lot of code in our tests that suffers from the same issues.