Bug 20360

Summary: FileStream Read on created Thread hangs the app after time
Product: Android Reporter: Rowan <rowan.harris>
Component: Mono runtime / AOT CompilerAssignee: Marek Habersack <grendel>
Status: RESOLVED FIXED    
Severity: normal CC: frederic.rudman, mono-bugs+monodroid, severo, tajinders, tj, udhams
Priority: High    
Version: 4.12.4   
Target Milestone: ---   
Hardware: PC   
OS: Windows   
Tags: Is this bug a regression?: ---
Last known good build:
Attachments: The app and data files that repeat the bug
android log of the issue happening
sample app that causes the issue faster than the previous app
source code for the test app and the resulting log showing the error
test apps to repeat my observations

Description Rowan 2014-06-04 19:54:53 UTC
Created attachment 6984 [details]
The app and data files that repeat the bug

If a FileStream is read over and over on a thread that is not the UI thread after about an hour the app hangs with no exception, just both the main UI thread and the extra thread doing the FileStream read are no longer running and if debugging and your press pause nothing happens.

If I move the FileStream read code to the main UI thread then it never hangs.

I have included the very simple app that repeats the issue (As there was no feedback in the log or from the app the only way I was able to debug this is was to start from my app of 60,000 lines of code and cut code out while the issue was still happening until I was left with 30 lines of code...) It has been a very painful week.

I also included the files that are being read in the folder Dados.rar

I have had this bug repeat on all 4 of my different android devices.
Comment 1 Rowan 2014-06-05 08:31:30 UTC
I found a work around for this bug, if I use FileInputStream it does not hang,
however their still is another similar hang in my app still. Is their any way
to get some diagnostics suggestions for finding the cause as I don't really
want to spend another week dividing the 60,000 lines down to 30 lines of code
again. I would say it is something similar to this bug as the result is exactly
the same, just freeze the app with no diagnostics information.
Comment 2 Udham Singh 2014-06-05 09:22:15 UTC
I have checked this issue at my end, but I am not able to reproduce this issue.
I have gone through following steps :

1. I downloaded the attached project and launch it on device.
2. I added the breakpoint in ThreadWork() method.
3. Then I debug the application, Then I press the paused. As expected, It
paused the application.
4. Then I let the application run for more than 1 hour and observed that It did
not hang.
5. Then again I followed the step 3.

Screencast : http://www.screencast.com/t/VLnWVJsKBD
Output Log : https://gist.github.com/saurabh360/34a092b1d40ed7d795d5

Please let me know if I have missed some steps to reproduce this issue.

Environment Info :

Microsoft Visual Studio Professional 2013
Version 12.0.30501.00 Update 2
Microsoft .NET Framework
Version 4.5.51641

Xamarin   3.0.54.0 (d3cf238e3845e930e312b6ec9b4c6c5437c33067)
Visual Studio extension to enable development for Xamarin.iOS and
Xamarin.Android

Xamarin.Android   4.12.4.20 (b5dc5ce91305e19de51d71a1122c109719c4bc34)
Visual Studio plugin to enable development for Xamarin.Android.

Xamarin.iOS   3.0.54.0 (d3cf238e3845e930e312b6ec9b4c6c5437c33067)
Visual Studio extension to enable development for Xamarin.iOS
Comment 3 Rowan 2014-06-05 19:25:24 UTC
In step 1 did you extract the folder from the Dados.rar file and place it on the SD card? So that the path matches "/sdcard/5500/Dados/Talhoes/"  The rar file only has the Talhoes folder inside so either create the 5500 and Dados folders yourself or modify the path in the app to remove those folders.

If the mount path of the sdcard on your device does not match the path I hard coded then please change it. Set a breakpoint and check that it is finding the files and reading the files.

Regarding the pause it only doesn't respond if the app has frozen. Up until then it will pause fine, just when the screen stops flashing blue and you try and pause the app to find the state of the program nothing happens.

Also the hour timeline is not always accurate, it can be 20 minutes and it can take 3 hours. But the average I estimate is 1 hour.
Comment 4 Rowan 2014-06-05 19:41:30 UTC
Sorry I just watched the screencast you did and you already did modify the path.

I will modify the app a little bit to output some data in the android log to help check that things are going as expected, and I will append my log file for you to compare against.
Comment 5 Rowan 2014-06-05 21:38:59 UTC
Ok to make it simpler you don't need the extra folder, I just create a file on the sd card in the app, then read that back.

Just replace the function ThreadWork with the new function, I have uploaded 2 logs of the issue, as you can see the log just stops with nothing unusual shown.

 protected void ThreadWork()
        {
            const string path = "/sdcard/temp.txt";
            var fSTmp = new FileStream(path, FileMode.OpenOrCreate, FileAccess.Write);
            fSTmp.Write(new byte[100000], 0, 100000);
            fSTmp.Close();
            long bytesRead = 0;
            int lastLog = System.Environment.TickCount;
            while (true)
            {
                if (System.Environment.TickCount > lastLog + 5000)
                {
                    lastLog = System.Environment.TickCount;
                    Android.Util.Log.Info("BytesRead", bytesRead.ToString());

                }

                var fS = new FileStream(path, FileMode.Open, FileAccess.Read);
                bytesRead += fS.Length;
                fS.Read(new byte[fS.Length], 0, (int)fS.Length);
                fS.Close();
                fS.Dispose();
            }
        }
Comment 6 Rowan 2014-06-05 21:41:33 UTC
Created attachment 6996 [details]
android log of the issue happening
Comment 7 Rowan 2014-06-10 20:23:57 UTC
Were you able to reproduce this?

Anything I can do to help let me know because the problem is very serious.

My environment info matches yours, (except that I use visual studio 2012, but I tested this and the same thing happens with xamarin studio.
Comment 8 Tajinder Singh 2014-06-11 09:27:16 UTC
Thanks Rowan

I have checked this issue with the code provided in comment 5 and I am still observing the same behavior as mentioned in comment 2.

Herewith I have attached output log:

For VS 2012 : https://gist.github.com/saurabh360/6516cbd0d844e942520e
For VS 2013 : https://gist.github.com/saurabh360/67febe0c7232e9e1ac6f

We may have to see if a developer can figure out what's going on from the output log.
Comment 9 Rowan 2014-06-16 04:54:36 UTC
Ok I have tried this with 2 additional machines compiling the code with the same environment as mentioned above. They both repeated this error, however it seems the first time the issue happens it can take a long time to happen. I noticed if the device was reset it would take anywhere from 4 hours to 56 hours for the problem to happen.

But if the problem has occurred and you don't reset the device and run the app again then it will happen much faster, usually around 1 hour.

I was never turning off my devices so I was always seeing it happen at the shorter time interval.

Unfortunately to repeat this you will need to run the app for an extended period of time for the first time.
Comment 10 Jonathan Pryor 2014-06-20 23:40:42 UTC
@Rowan: Do you still see the problem if you use System.Threading.Thread instead of Java.Lang.Thread? (Not sure why that would matter, but worth testing...)
Comment 11 Rowan 2014-06-21 00:49:23 UTC
yeah I already thought of that and gave it a try but had the same result unfortunately
Comment 12 Rowan 2014-06-22 03:53:41 UTC
There is a similar bug that I posted that is easier to repeat that shows similar behaviour so may be related to this one, it may be unrelated but worth taking a look. https://bugzilla.xamarin.com/show_bug.cgi?id=20794
Comment 13 Rowan 2014-06-29 00:38:25 UTC
I have modified the app to make this more repeatable, if I just create more threads to do the same task just have 4 threads doing this task at the same time it brings the freeze much quicker, anywhere from 5 minutes to 20 minutes. Again after it happens the first time it seems to happen much faster the second time, so don't reboot the device after the first time.

I have attached the modified sample app "FileStreamBug2.rar". I also added some extra debug strings output to the log.

I added a check for threads being paused (as I could see the screen stop updating briefly) On each thread if there is a delay of more than 100ms on any of the work threads or the render thread then it will output a message to the log.

From this I have seen that the render thread is the one that is being delayed periodically. Usually 130ms to 150ms delays are being written to the log. But none of the work threads are having delays. I experienced the same behaviour for 1 work thread or for 4 work threads, the render thread was still having these large delays for no reason.
Comment 14 Rowan 2014-06-29 00:41:28 UTC
Created attachment 7229 [details]
sample app that causes the issue faster than the previous app

just uses 4 work threads instead of 1.
Comment 15 Marek Habersack 2014-06-30 07:12:50 UTC
Rowan, a suggestion similar to made by Jon in comment 10 - have you tried using either a threadpool or System.Threading.Tasks? If you didn't, can you try if they help? I'm currently running the application waiting for the deadlock to happen. It's been running for 10 minutes so far without any problems.
Comment 16 Rowan 2014-06-30 08:26:08 UTC
I just tried both thread pool and task and I have had crashes with both approaches  within a few minutes.

What device are you using, is it quad core?

I have the issue happening on Galaxy S2, Galaxy Note 2, Nexus 5, a Freescale IMX6 custom device, rockchip hdmi device.

With varying android versions from 4.1.2 to 4.4.2

I have not had a device not repeat this issue.
Comment 17 Marek Habersack 2014-06-30 08:30:23 UTC
I was able to eventually reproduce it. I'm using Nexus 10 and Nexus 5, so yes, quad cores. Both of my nexuses are running android 4.4.4.
Comment 18 Marek Habersack 2014-07-01 09:40:59 UTC
Rowan, while I don't have a fix since I'm not even able to get a thread dump from the application running on device once it locks up (gdb locks up as well, go figure) I have managed to make the sample you supplied to run without locking up.

The change I made is related to the memory usage. Inside the while loop in ThreadWork you allocate a 100k byte buffer on each iteration. Creating a buffer variable and allocating the memory only when the buffer is null or too small made a world of difference. Before the change I got the lock up within 2-3 minutes on my Nexus 10, it's been running for 40 minutes now without a hitch. Looking at the log you provided in comment 6 I now lean towards explanation that the memory allocation puts too much pressure on the operating system memory and both the OS and dalvik end up deciding that the app is too aggressive with its memory usage and stop it. I don't have a confirmation for this, except for my modification which makes the dalvik GC messages disappear from the log completely. At the same time I'm running the test with Mono's GC logging enabled and I don't see anything out of ordinary in the output.

I don't know how closely the sample you provided reflects what's happening in your application but I'd suggest implementing a simple memory pool to provide buffers for the file reads. Also, I'd recommend not creating an instance of FileStream in each iteration (it's very expensive to set up) but reuse the stream as much as possible by seeking. If you keep reading a single file which is growing, it's easy to read the file state to obtain its size and read just the data that was added since last read. Alternatively, you can use FileSystemWatcher to observe the file(s) for changes and read only when a change is signalled.

Please let me know if memory pooling helps your application.
Comment 19 Rowan 2014-07-01 22:28:24 UTC
This case does not accurately reflect my app, my app is very complex, however this was one case I was able to isolate that the environment just stops responding. So I am not looking to stop this one case happening, I am looking to find the reason it stops happening. Switching to Java.IO.FileInputStream seems to reduce occurrence of this 1 case. But it is still happening.

The question is if too much memory allocation per second was the reason, then why would just switching to Java.IO.FileInputStream instead of FileStream stop this from happening? It is still allocating a similar amount of memory.

And why would the OS not kill the app completely, when the OS runs out of memory it kills the process completely and puts this in the android log. This is just freezing the app with nothing put in the log.

I just had a stack trace output for once on a crash!! It is not the exact same application, it is using thread pool. (And creating a new thread each time instead of a while loop. Which was still just crashing the same way, then I was testing Java.IO.FileInputStream which was crashing much less but on one occasion it sent out some info!! I have attached the file "spin on suspend.rar" that has the source code and the output.

You can see that the GC can not stop one of the threads for some reason. What could cause that to happen? 

This sounds very familiar to something else I have seen in my real app. There was a period of almost 2 second in my app where all threads where halted for no reason. The log did not say that the garbage collection took a long time, only 20 or 30ms was reported. But all threads were halted for almost 2 seconds. So it seems like the GC is trying to stop the threads so it can run but some thread is not halting as it should. It could be a separate issue of something else causing all threads to stop for 2 seconds but it does sound similar. Sometimes the thread does eventually halt.
Comment 20 Rowan 2014-07-01 22:30:12 UTC
Created attachment 7258 [details]
source code for the test app and the resulting log showing the error
Comment 21 Rowan 2014-07-02 23:37:33 UTC
Ok here is the next weird test.

I created a test app in Java (no openGl) to test your theory of memory allocation being the issue, the app is not crashing doing the same thread work. Which goes against that theory as it is running faster if anything and allocating/freeing more bytes per second.

So I did the same thing in mono, a basic activity app with no opentk, and it also is not crashing also.

So from my original opentk app I removed all my code for doing threads and commented commented out the onrender function. So the app was the minimum app for opentk, and in the log the GREFs were still increasing like crazy and the garbage collector was being called every 6 seconds! The GREF's in the log were always increasing like crazy.

So it is pointing to something wrong with opentk. The memory allocation just seems to bring on the issue, like opentk has corrupted the memory and as the app is always requesting new memory it gets passed memory it should not have.
Comment 22 Marek Habersack 2014-07-03 02:37:33 UTC
Rowan, thanks for the update! It's a good clue for the direction we can look in. On my part, I was digging in the Mono runtime and the GC and it seems that what's happening is some external trigger (I still think it's the OOM) which causes ALL of the process threads to enter sleep state (deep in the kerenel, in this case in sys_sigsuspend). I'm still trying to figure out what the external factor might be as well as keep looking for other causes in the runtime.
Comment 23 Rowan 2014-07-03 19:53:19 UTC
to see if it was opentk I made the same test but with native Android.Opengl
It also caused the problem (not as often taking longer for it to happen) but sometimes with different symptoms, sometimes it killed the device, turned it straight off...

The version with no opengl has been running on multiple devices and still has not crashed yet.
Comment 24 Rowan 2014-07-04 00:55:42 UTC
Created attachment 7275 [details]
test apps to repeat my observations

I have some more results, I created a java opengl app, identical to the mono native opengl app.

I increased the file size by 5 times to further test the memory theory.

The java app is yet to fail after 6 hours so far, the mono app has failed 3 times in that same period. The garbage collector is going crazy due to all the allocations but it is not failing in java and in mono it is failing and sometimes killing the OS also.

I have attached the apps if you want to look closer at my results and tests. (as well as the other apps from the comments above)

I also included the file to use for the tests so the same file is used each time. Just put that file on the /sdcard/ directory
Comment 25 Marek Habersack 2014-07-04 02:13:17 UTC
Rowan, thanks for the tests! It's definitely something with the GCs (either the interaction of them or one side is misbehaving) and I'm still in the process of looking for the cause.
Comment 26 Rowan 2014-07-04 02:26:55 UTC
ok great! 

One extra bit of info, I just had the non opengl mono app crash. (the one that is in the sample apps from comment 24)

I had added one extra piece of code to that app, when I tested it all yesterday and it did not crash their was no updating of the screen so I was not asking the main UI thread to do anything. Which in all the other apps the main UI thread is being called all the time to update the screen. So I added a thread that causes the activity to refresh the view every 30ms (the same rate as the opengl apps) And this caused the crash to happen.

So it is nothing to do with the opengl librarys themselves, but some interaction of the main ui thread.

Good luck with the hunt!
Comment 27 Marek Habersack 2014-07-04 02:32:37 UTC
Yeah, there's something happening that triggers some external response which, in effect, halts the process. The interesting thing is that the OS does *not* consider the app dead. The screen updates find is a good hint. There might be an awful lot going on behind the scenes when the screen is updated in opengl context. Currently, however, I'm really focused on finding where and how the process is suspended in the GC. Once I know that I can move to looking at the other areas :)
Comment 28 Marek Habersack 2014-07-10 06:59:17 UTC
This bug was fixed by https://github.com/mono/mono/commit/afa4cebee3d00dbb198880ffcc4cf0f8f237a23b

Note that the fix is *not* present in any Xamarin.Android release as of this moment.
Comment 30 Frederic Rudman 2014-08-20 10:20:28 UTC
We are experiencing something very close (but not exactly the same) as the problem listed above. The symptoms are described below but first, a quick question: it's now August 20th, 2014: has the fix above been released to Android/IDE yet or not. If it has, our problem is NOT solved. If not, I'm hoping maybe your fix will work for us. This is a serious/major/show-stopper problem for us.



[IMPORTANT: as per the last statement below: If it has been fixed but not yet released, is there a way for us to test your fix right away? Please let us know. Thanks.]



Our environment: An Android-based UDOO Quad (Android running on the Freescale i.MX6 side; Arduino running on an Atmel SAM3x mcu). We're trying to use the Arduino connected to Android as per Android's ADK/USB Accessory.

Steps to reproduce are consistent (sadly): we simply get the USBManager (as per Android docs/protocol), open the FileInputStream from the Accessory's FileDescriptor (a Java object) and start reading that fileInputStream. [on the Arduino side, we use USBHost/ADK, also as per protocol]

For the tests, we kept everything simple: we simply periodically write a few bytes on the Arduino side to be read on the Android side. Reading is done a Thread (.net) and upon reception of bytes we update the UI (via RunOnUIThread calls, obviously). All threading/reading is done on the .Net side, in a simple loop with NO dynamic object allocations: we use a once-allocated byte buffer over-and-over, 1024 bytes.

All is well for a while: we've had anywhere from 12,000 reads all the way up to 30,000 reads depending on how we varied the delay on the Arduino's side (a simple write functions that writes M bytes consistently to the USB Accessory every N milliseconds: we tested N with 50ms, 300ms, 330ms, 400ms, and 500ms, and matrix tested with M at 9 bytes, 14 bytes, and 26 bytes.

Although the failure rates were sometimes different we've ALWAYS failed in the fileInputStream.read after anywhere from 12,000 reads (a few minutes at 50ms) to 30,000 reads (a few hours at 500ms).

All these reads are done on a simple separate .Net THREAD (not Java) and it was never a hang (unlike the symtoms described by you above): we always end up with a caught Java.IO exception (caught from the .net try/catch side): most often (99%, but not 100%) it's a "No Such Device/ENODEV" error (as if the "device" disappeared suddenly" [at libcore.io.IoBridge.read (IoBridge.java: 435; FileInputStream.java: 179]). We can supply all the test code, if needed/wanted, to consistently reproduce this problem.

When the reads succeeded (up until the end) the bytes read were always correct (not a single error).

We've also tested it with regular and, separately, occasional writes (using FileOutputStream) in case we needed to perhaps "prime/refresh the pump/pipeline" both ways. No go.

We've also tested with regularly CLOSING, then REOPENING the file streams (both input and output) after a fixed number of messages (e.g. 1,000) as a "soft reset" feature. That definitely improved the number of loops (from our early 3K-5K to 12K-30K) but still failed in the end.

We suspected all along that it was a memory/GC problem but when we list available RAM (using Android's MemoryInfo) it shows plenty of memory and very little used (our code does NOT do dynamic allocations).

We need to be able to have our Arduino hardware communicate "continuously" over the USB Accessory port for MONTHS on end, not just a few minutes or hours. We can probably reduce the frequency of writes from 50ms to maybe 500ms [that would be slow but we could adjust for that] in case speed is an issue for the USB Accessory link, but it needs to be either:
1) consistent and never fail) 
...OR...
2) If it must fail, we need to be able to reset/restart it in software (we can/could [probably] afford a few seconds/minute of no comm while that link is being reset).

Does any of this seem familiar to you folks or should I file this as a separate bug?

Would there be any workaround in the meantime: this bug is a major show-stopper for us. The biggest problem is that when the communication between Arduino and Android dies, we have no way to restart it: Android cannot be software-tricked into Resetting the USB Accessory (unless you know something we don't?)

IMPORTANT: If this is likely the same problem you fixed but it has been fixed but not yet released, is there a way for us to test your fix right away? Please let us know. Thanks.
Comment 31 Frederic Rudman 2014-08-20 10:22:41 UTC
[as a hint in case it helps, this is the environment we use:

=== Xamarin Studio ===

Version 5.2.1 (build 1)
Installation UUID: 75a1b04b-0573-48dd-9033-bf08bc7f5047
Runtime:
	Microsoft .NET 4.0.30319.18444
	GTK+ 2.24.22 (MS-Windows theme)
	GTK# 2.12.25

=== Xamarin.Android ===

Version: 4.14.0 (Indie Edition)
Android SDK: C:\Users\DV18Admin\AppData\Local\Android\android-sdk
	Supported Android versions:
		2.1   (API level 7)
		2.2   (API level 8)
		2.3   (API level 10)
		3.1   (API level 12)
		4.0   (API level 14)
		4.0.3 (API level 15)
		4.2   (API level 17)
		4.3   (API level 18)
		4.4   (API level 19)
Java SDK: C:\Program Files (x86)\Java\jdk1.6.0_39
java version "1.6.0_39"
Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
Java HotSpot(TM) Client VM (build 20.14-b01, mixed mode, sharing)

=== Build Information ===

Release ID: 502010001
Git revision: d06832ce9807d6be24aca225457e8b37c7669f6f
Build date: 2014-08-07 14:35:09-04
Xamarin addins: 1de032531be4cecf2f39dbee3b87aac78204058c

=== Operating System ===

Windows 6.1.7601.65536 (64-bit)