Bug 19665 - Race in socket file descriptor handling
Summary: Race in socket file descriptor handling
Status: NEW
Alias: None
Product: Runtime
Classification: Mono
Component: io-layer (show other bugs)
Version: unspecified
Hardware: PC Mac OS
: Normal normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2014-05-10 17:12 UTC by Andrea Canciani
Modified: 2016-07-11 18:49 UTC (History)
5 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Simple test for socket blocking calls. (4.33 KB, application/octet-stream)
2014-05-15 05:26 UTC, Andrea Canciani
Details

Description Andrea Canciani 2014-05-10 17:12:41 UTC
The branch https://github.com/ranma42/mono/tree/wip/racy-io shows the issue.
To reproduce locally, go in mcs/class/System and run:

make run-test TESTNAME=System.Net.Sockets.SocketTest.CloseAndOpenWhileReceiving

In my environment it fails with:

Runtime Environment - 
   OS Version: Unix 13.1.0.0
  CLR Version: 4.0.30319.17020 ( 3.4.1 ((no/ef5a0f9 Sat May 10 20:30:33 CEST 2014) )

Selected test: MonoTests.System.Net.Sockets.SocketTest.CloseAndOpenWhileReceiving
Excluded categories: NotOnMac,NotWorking,ValueAdd,CAS,InetAccess
.0x112e84000 socket 6
0x112e84000 closed 6
0x112e84000 socket 6
0x113ef9000 sleeping on 6...
0x112e84000 closed 6
0x112e84000 socket 6
0x113ef9000 receiving on 6...
0x113ef9000 received on 6 (-1)
0x112e84000 closed 6
F
Tests run: 1, Failures: 1, Not run: 0, Time: 2.355 seconds

Test Case Failures:
1) MonoTests.System.Net.Sockets.SocketTest.CloseAndOpenWhileReceiving : CloseAndOpenWhileReceiving wait timed out
at MonoTests.System.Net.Sockets.SocketTest.CloseAndOpenWhileReceiving () [0x00081] in /Users/ranma42/Code/sharp/mono/mcs/class/System/Test/System.Net.Sockets/SocketTest.cs:2711
at (wrapper managed-to-native) System.Reflection.MonoMethod:InternalInvoke (System.Reflection.MonoMethod,object,object[],System.Exception&)
at System.Reflection.MonoMethod.Invoke (System.Object obj, BindingFlags invokeAttr, System.Reflection.Binder binder, System.Object[] parameters, System.Globalization.CultureInfo culture) [0x00054] in /Users/ranma42/Code/sharp/mono/mcs/class/corlib/System.Reflection/MonoMethod.cs:230

The "[NOT FOR MERGING]" commit of the branch can be used to make the issue much more likely by introducing delays that could otherwise occur because of scheduling, but that would not be in general very likely/easy to reproduce.
It also adds some logging to show what is happening: basically it is possible that while a thread is getting ready to receive on a socket, another one closes it and opens a new one which gets the same file descriptor. In this case the first thread won't notice that the socket it wanted to use has been closed and will use the new one (because both were assigned the same file descriptor).
Comment 1 Andrea Canciani 2014-05-15 05:26:24 UTC
Created attachment 6815 [details]
Simple test for socket blocking calls.

Is recv() interrupted by:
1. signal/APC
2. shutdown()
3. close[socket]()
?

socktest.cpp shows that different OSes have different behaviour.

Linux: 1, 2
MacOSX: 1, 3
Windows: 3

On Windows the only way to interrupt (in a clean way, i.e. without
aborting a whole thread) a recv() seems to be closesocket(). This
would cause a race because it is not possible to ensure that no other
socket is created between a closesocket() and a recv on the socket
that was just closed (save for using global synchronization).

An alternative approach is to use pipes/events (respectively on
POSIX/Win32 systems) and to select()/WSAWaitForMultipleEvents() the
socket on which the actual operation is performed and the object that
is used to signal interruption.

Pseudocode for all of these approaches follows:

Socket {
int fd; // file descriptor (or SOCKET in Win32)
bool closing = false;
// A "closing" Socket is a Socket on which (managed-level) close
// has been requested. It might not yet be closed, but further operations
// on it will not be started.
int refs = 0; // number of operations currently being performed on the socket
int aux[0,1]; // initialized with two ends of a pipe
WSAEvent closed = WSACreateEvent();
}

Utility functions (can be implemented with locks or lock-free):
// atomically sets the socket as closing
// returns true if it has just become closing, false otherwise
flag_closing(Socket s)
{
  lock(s) {
    if (s.closing)
      return false;

    s.closing = true;
    return true;
  }
}

// atomically decreases the number of references to the socket
unref(Socket s)
{
  lock(s) { s.refs--; }
}

// atomically decreases the number of references to the socket; if the
// last reference is removed this way, perform the unmanaged close on
// the socket file descriptor
unref_close(Socket s)
{
  lock(s) {
    if (--s.refs == 0)
      close(s.fd);
  }
}

// atomically check if the socket is useable and, if it is, increase the number of references
// returns true if the socket is useable, false otherwise
ref(fd)
{
  lock {
    if (!fd.closing)
      fd.refs++;
    return !fd.closing;
  }
}

Win32 (event):
// assumption: all sockets are overlapped and an additional flag
//   controls their blocking/nonblocking behaviour (only showing
//   blocking behaviour here)
safe_close(Socket s)
{
  if (flag_closing(s)) {
    shutdown(s.fd);
    WSASetEvent(s.closed);
    unref_close(s);
  }
}

safe_recv(Socket s, ...)
{
  if (ref(s)) {
    overlapped_recv(s.fd, ...);
    WSAWaitForMultipleEvents([over_recv, s.closed]);
    if (event is s.closed) {
      recv aborted
    } else {
      WSAGetOverlappedResult() 
    }
    unref_close(s);
  }
}

Linux (shutdown):
safe_close(Socket s)
{
  if (flag_closing(s)) {
    shutdown(s.fd);
    unref_close(s);
  }
}

safe_recv(Socket s, ...)
{
  if (ref(s)) {
    recv(s.fd, ...);
    unref_close(s);
  }
}

POSIX (pthread_kill):
safe_close(Socket s)
{
  if (flag_closing(s)) {
    shutdown(s.fd);
    unref(s);
    while(s.refs) {
      lock(s.blocking) {
        for (t : s.blocking)
          pthread_kill(t);
      }
      pthread_yield(); // or a brief sleep()? see mono_thread_info_safe_suspend_sync()
    }
    close(s.fd);
  }
}

safe_recv(Socket s, ...)
{
  if (ref(s)) {
    s.blocking.add(this_thread);
    recv(s.fd, ...);
    s.blocking.remove(this_thread);
    unref(s);
  }
}

POSIX (pipe):
// assumption: all sockets are nonblocking and an additional flag
//   controls their blocking/nonblocking behaviour (only showing
//   blocking behaviour here)
safe_close(Socket s)
{
  if (flag_closing(s)) {
    shutdown(s.fd);
    write(s.aux[1], {0}, 1);
    unref_close(s);
  }
}

safe_recv(Socket s, ...)
{
  if (ref(s)) {
    select([s.fd, s.aux[0])
    if (s.aux is ready) {
      recv aborted
    } else {
      recv(fd, ...); // if recv fails and should be blocking, retry?
    }
    unref_close(s);
  }
}


"shutdown" is the most efficient solution (no additional data
structures or inter-thread communication), but it only works on Linux
and it might not work for all operations (see the comment for
mono_thread_info_abort_socket_syscall_for_close()).

"pthread_kill" only works on POSIX. It has some overhead (locking, an
additional data structure), which should be negligible in the case of
single-threaded application. The overhead should only be measurable in
the cases where the naive approach would result in races.

"pipe" does not require additional data structures, but it needs two
additional file descriptors per socket. It could also be used on
Windows, replacing pipes with normal sockets.

"event" only works on Windows; it is probably more efficient than
emulating pipes with sockets and it avoid.

I believe that .NET is using a deprecated API for this purpose:
http://stackoverflow.com/questions/31871/wsacancelblockingcall-exception


I would prefer to avoid that and instead to use events on Win32 and
pthread_kill on POSIX. The rationale is as follows.

On Win32 the options are events or emulated pipes. Pipes could be more
portable, but they are probably more expensive and would require us to
reimplement [non-]blocking socket logic externally.

On POSIX pipe would be a cleaner solution (in my opinion), but it
would also be more resource intensive (each socket would consume 3
file descriptors). This cost would be paid even by single-threaded and
non-racy applications. Conversely pthread_kill would slightly increase
the cost of the socket operations (because of synchronization and the
memory needed to keep the list of blocked threads), but the
synchronization cost should be negligible in most cases (if there is
no contention) and the memory usage is easy to control (basically we
would need to augment the thread structure with a field that can be
used to track when it is blocked in a syscall). Signals are already
extensively used for inter-thread communcation and pthread_kill()'ing
a list of blocked threads could be a general-purpose way to interrupt
syscalls, even those which cannot be performed in a non-blocking way
and/or select()'ed. The pthread_kill approach also has the advantage
that the close() is actually performed on the thread that initiated
the socket close, so that lingering and other blocking behaviours of
the socket occur on that thread.

Comments and suggestions (on other possible ways to tackle the issue,
on more tests to be performed, on pros/cons of the implementations, on
the way forward) are welcome. In particular, I would like to get an
ack before starting the implementation of any of the approaches.
Comment 2 Rodrigo Kumpera 2014-06-02 11:40:12 UTC
To fix this we would need to do the following:

-Refactor the io-layer handle code so it's used even on windows;

-Change all IO paths to ensure we ref/unref around IO ops;

-Extend our cancellation code to handle using events on windows;
Comment 3 Rogier Hofboer 2016-07-11 18:47:33 UTC
I think we are experiencing this problem or something related to it.
When sending a POST using HttpWebRequest, sometimes the data is prepended with 'garbage'. It only occurs when other threads are also opening/closing sockets at the same time. Maybe it is related to:
https://bugzilla.xamarin.com/show_bug.cgi?id=41506

Has anyone ever made a patch for this? Also if it is not finished / very raw I am very interested.
Comment 4 Rogier Hofboer 2016-07-11 18:49:03 UTC
Note about the pseudo code above. Please comment on where you think this should be added, in io-layer, metadata or in mcs/class.

Note You need to log in before you can comment on or make changes to this bug.