Bug 33286 - Sockets not properly returning from unmanaged threadpool
Summary: Sockets not properly returning from unmanaged threadpool
Status: NEW
Alias: None
Product: Runtime
Classification: Mono
Component: io-layer (show other bugs)
Version: 4.2.0 (C6)
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2015-08-20 21:34 UTC by Jim Borden
Modified: 2015-08-27 21:36 UTC (History)
3 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
The sync gateway binary needed to run the unit test (59 bytes, text/plain)
2015-08-20 21:38 UTC, Jim Borden
Details

Description Jim Borden 2015-08-20 21:34:46 UTC
I tagged this as 4.2.0 but this affects as many different versions as I can think of (3.2.8, 3.12.1, 4.0.3-20, 4.2, master are the ones I have tried).  This problem happens ONLY on Linux (both x86/x64, Debian 8 and Ubuntu 15) while Xamarin.iOS, Xamarin.Android, Windows .NET 3.5 / 4.5, OS X .NET 3.5 / 4.5 do not exhibit this behavior.

**BACKGROUND**

I am developing a product which makes use of several concurrent HttpClient objects, especially in unit tests.  The unit test HttpClients have no special setup, but the library HttpClient objects are set up with two layers of message handling (one to process HTTP 401 responses, and the other to retry requests for transient network errors).  For what it's worth I've tried to combine some of these objects into one but it didn't entirely resolve the problem.  A replication in this library consists of several http requests and responses (GET, DELETE, PUT, and POST are used).

**PROBLEM**

HttpClient.SendAsync() never completes after a certain number of requests.  Drilling down further and further this is what I notice.  The call to GetResponseAsync() that eventually gets reached inside Mono never returns.  It seems that the request is getting stuck and never goes out over the wire (according to Wireshark).  After many hours of examination, I finally tracked down the transition between managed and unmanaged at socket_pool_queue.  Once the failing connection goes inside there it never returns.  A successful request should cause a callback to DispatcherCB in SocketAsyncWorker.cs, but this goes silent following the request in question.  Pausing the Mono debugger while the freeze is happening reveals nothing in particular.  All thread pool threads are asleep waiting for work, and the two threads spawned by my program (the main thread and a database i/o thread) are both waiting.  The main thread is waiting for the replication sequence between the client and server to finish (by observing its event callback that gets called every time the replication status changes) and the database i/o thread is waiting for new work (the consumer in a producer consumer model).  I'm not sure if this is relevant but the request that hangs is almost always POST.

**ADDITIONAL INFO**

The unit tests waits for 60 seconds for the replication to finish before declaring it hung and unsuccessful.  However, more often than not after this is finished then the callbacks for the socket operations start firing again.  Everything about this points to a problem in my code, but if I were deadlocking I should be able to see it in the managed stack traces.  I'd be happy to try to run a debugger on the runtime itself, but my limited experience with that has rendered that task a bit daunting.

** SOURCE **

The source for the unit test I described above is located at https://github.com/couchbase/couchbase-lite-net/blob/release/1.1.1/src/Couchbase.Lite.Tests.Shared/ReplicationTest.cs#L529.  To run this test you must first clone the repo https://github.com/couchbase/couchbase-lite-net and open the Couchbase.Lite.Net45.sln solution in the src directory.  create a local-test.properties file by copying and pasting the test.properties file.  Copy the executable I attached to this ticket into the Tools directory (for x64, if you need x86 let me know) and run the sync_gateway script found in the root directory.  After that just run the TestPusher test using either Monodevelop or nunit-console and you will observe that the program hangs after some lines that look like this (on 4.0.3-20)

Replication: NotifyChangeListeners (0/0, state=Running (batch=0, net=1))
    Thread Name: Threadpool worker
    Date Time:   8/21/2015 9:55:56 AM
ReplicationObserver: Couchbase.Lite.ReplicationChangeEventArgs changed: 0 / 0
    Thread Name: Threadpool worker
    Date Time:   8/21/2015 9:55:56 AM
ReplicationObserver: ReplicationFinishedObserver.changed called, but replicator still running, so ignore it
    Thread Name: Threadpool worker
    Date Time:   8/21/2015 9:55:56 AM

The master branch makes it slightly further in the process but still exhibits the same symptoms.
Comment 1 Jim Borden 2015-08-20 21:38:50 UTC
Created attachment 12594 [details]
The sync gateway binary needed to run the unit test
Comment 2 Jim Borden 2015-08-27 21:36:58 UTC
For what it is worth, with the latest (as of August 27, 2015) release of Mono into the apt-get repo (4.2.0.179) this problem doesn't happen as much (if at all) anymore.

Note You need to log in before you can comment on or make changes to this bug.