Bug 38161

Summary: instability/crash when using DateTime under cpu load
Product: Android Reporter: Rowan <rowan.harris>
Component: Mono runtime / AOT CompilerAssignee: Zoltan Varga <vargaz>
Status: VERIFIED FIXED    
Severity: normal CC: abhisheks, adrian.murphy, diegogadenz, grendel, guilherme, jonp, ludovic, mono-bugs+monodroid, otavio, peter.collins, rosado.czo, shrutis, vargaz
Priority: Normal    
Version: 6.0.1 (C6SR1)   
Target Milestone: C6SR4   
Hardware: PC   
OS: Windows   
Tags: Is this bug a regression?: ---
Last known good build:
Attachments: this is the project file for repeating the bug
logCat from application start, to the crash on the imx6
Asus Zenphone 2 (x64) Android 5.0 LogCat
emulator x64 android 6.0
An image showing a similar bug
galaxy note 3 logcat

Description Rowan 2016-01-29 03:37:45 UTC
Created attachment 14783 [details]
this is the project file for repeating the bug

In the sample application attached the following line throws an exception occasionally under high load.

new DateTime(0).AddMilliseconds(1);

the exception is System.ArgumentOutOfRangeException: Value to add was out of range.

This can happen when the system is under load, if not under load on multiple threads it does not throw an exception.

Other strange behavior can happen when using the DateTime when the app is under load. I have had a thread blocked and never continue again, but the cpu usage goes to 100% on one of the cpu cores however it is probably related to this bug and I have managed to get this bug repeatable so lets try and solve this problem first.

It is very repeatable on Android 4.4 and Android 4.1, just run the attached project and in less than 1 minute it should crash with that exception.

With Android 5.0 it does not seem to crash as often, and the behavior is a little different, it restarted the phone.
Comment 1 Rowan 2016-01-29 03:52:55 UTC
I should also add that time is very very important to us, our customers are seeing these freeze/crash issues so please any help would be greatly appreciated.

Just to explain the sample app to make it clear, it is just a new opengl app, with 1 class ThreadObj added to GlView.cs

There are 4 instances of ThreadObj created, so 4 threads end up running DoDates method which just allocates some memory and adds removes that memory from a list, then executes the new DateTime

 public class ThreadObj
        {
            private readonly Thread _thread;
            private readonly List<byte[]> list = new List<byte[]>();

            public ThreadObj()
            {
                _thread = new Thread(DoDates);
                _thread.Start();
            }

            private void DoDates()
            {
                while (true)
                {
                    while (list.Count < 100)
                    {
                        list.Add(new byte[100]);
                    }

                    while (list.Count > 0)
                    {
                        list.RemoveAt(0);
                    }

                    for (var i = 0; i < 100; i++)
                    {
                        new DateTime(0).AddMilliseconds(1);
                    }
                }
            }
        }
Comment 2 Jonathan Pryor 2016-01-29 14:27:41 UTC
> throws an exception occasionally under high load.

How "occasionally" is "occasionally"? How long do you usually need to wait?

I extracted out ThreadObj into a normal command-line app and a simple Main(), and mono 4.2.2 doesn't crash within ~1 minute of execution:

  class App {
    public static void Main ()
    {
      var threads = Enumerable.Range (0, Environment.ProcessorCount*16)
        .Select (i => new ThreadObj ())
        .ToArray ();
      Console.Write ("Press <ENTER> to quit");
      Console.ReadLine ();
      Environment.Exit (0);
    }
  }

Perhaps a minute isn't long enough? Perhaps desktop mono 4.2.2 isn't close enough to the 4.2-branch that Xamarin.Android is using? Perhaps it's an x86 vs. ARM issue? 

> it is just a new opengl app, with 1 class ThreadObj added to GlView.cs

Are you able to repro this *without* using an OpenGL app? Does the "normal" default app also exhibit this behavior for you? (Perhaps GL itself is a controlling factor?)
Comment 3 Rowan 2016-01-29 15:35:37 UTC
Thanks for the fast response! 

I tried running with a 4.4 emulator and it did not crash for me either. I could only get it to crash with a physical ARM device. The main device I am testing is a Freescale imx6. However I also have an old mediatech device that has the same response.

In this test case the crash happens on my device within 60 seconds. Most of the time around 10 seconds, but sometimes up to 60.
Comment 4 Rowan 2016-01-29 16:49:03 UTC
The same issue is also happening with Android 5.0 on the imx6
Comment 5 Rowan 2016-01-29 18:07:45 UTC
I tried a few more devices, an Asus Zenphone 2 (x64) Android 5.0 and it crashes every time after only a few seconds.

The Nexus 5/7 with Android 6.0 does not crash, the Galaxy Note 2 does not crash either after running for a few hours.
Comment 6 Marek Habersack 2016-01-30 11:26:32 UTC
Can you attach logcat output for the crash, including the managed stack trace and (if any) native stacktrace?
Comment 7 Rowan 2016-01-30 12:07:34 UTC
Created attachment 14811 [details]
logCat from application start, to the crash on the imx6
Comment 8 Rowan 2016-01-30 12:29:03 UTC
Created attachment 14812 [details]
Asus Zenphone 2 (x64) Android 5.0 LogCat

The Zenphone gives a different exception, it gives a null reference exception.
Comment 9 Rowan 2016-01-30 13:20:33 UTC
I played around some more and adding a sleep to the while loop stops the crash happening.

The higher the sleep the less likely the crash is to happen, on the Asus it requires about 10ms sleep, on the imx6 it requires only 3ms.

Also the for loop on the dates was not necessary, I am getting the same results for

for (var i = 0; i < 1; i++)
{
    new DateTime(0).AddMilliseconds(1);
}
Comment 10 Rowan 2016-01-30 16:28:48 UTC
Created attachment 14813 [details]
emulator x64 android 6.0

I increased the allocation size and the number of list items and I got a crash to happen in android 6.0 x64 emulator. The android log has a lot of detail which could help you. The crash took about 60 seconds or so.

Here is the class I was using, I added an alive message to the log just to check on the status of each thread.

public class ThreadObj
{
    private readonly Thread _thread;
    private readonly List<byte[]> list = new List<byte[]>();
	static int _cnt;
	int _idx;

    public ThreadObj()
    {
		_idx=_cnt++;
        _thread = new Thread(DoDates);
        _thread.Start();
    }

	int _time;
    private void DoDates()
    {
        while (true)
        {
            while (list.Count < 1000)
            {
                list.Add(new byte[10000]);
            }

            while (list.Count > 0)
            {
                list.RemoveAt(0);
            }

            new DateTime(0).AddMilliseconds(1);

			if (Environment.TickCount - _time > 500) 
			{
				_time = Environment.TickCount;
				Log.Info("alive", _idx.ToString());
			}
        }
    }
}
Comment 11 Marek Habersack 2016-01-30 21:58:07 UTC
(In reply to Rowan from comment #8)
> Created attachment 14812 [details]
> Asus Zenphone 2 (x64) Android 5.0 LogCat
> 
> The Zenphone gives a different exception, it gives a null reference
> exception.

I'd say this is an OOM. Note that the null reference exception is thrown from the DoDates method - which means that either the 'list' variable is null (unlikely) or DateTime instantiation fails and is somehow missed by the runtime (no idea how, just guessing).
Comment 12 Marek Habersack 2016-01-30 22:08:16 UTC
(In reply to Rowan from comment #7)
> Created attachment 14811 [details]
> logCat from application start, to the crash on the imx6

Now this one is more interesting. https://github.com/mono/referencesource/blob/mono/mscorlib/system/datetime.cs#L349 - this is the code that most likely throws the exception (I'm assuming 'millis' is neither smaller nor bigger than '-MaxMillis' and 'MaxMillis', respectively, since you pass a 1). In this case 'value == 1' and 'scale == 1', but somehow the calculation on the indicated line throws an OverflowException. This might be an instability in the FPU of the ARM CPU (I saw something like this happen, when occasionally the FPU would deviate from its usual spectrum of double value calculation results) or a bug in the Mono runtime.

Rodrigo, do you have any ideas about this one (I'm re-assigning the bug to you temporarily - if you decide it's not the runtime, please bounce it back to me)?
Comment 13 Rowan 2016-01-31 14:04:04 UTC
Created attachment 14819 [details]
An image showing a similar bug

Ok this is making some sense, in our real application I have been having a similar issue, some times the AddMilliseconds throws the exception, and other times the application gets stuck in an infinite loop. I captured this case yesterday while debugging. Look at the "Infinite loop.png"  A constant float that is not modified inside the function has a different and crazy value, how could that variable change? There was no FPU calculations stored to that variable, so to me it looks like an issue with the runtime.
Comment 14 Marek Habersack 2016-01-31 19:30:16 UTC
floating-point numbers are not precise beasts. It is possible (and accepted) that the value will not be precisely what you expect it to be. I *think* that the only type that guarantees the precision is decimal. float and double shouldn't be expected to be precisely what you put in them (they most often will be precise, but that's just *most of the time* really). The IEEE standard defines precise results of every FP operation, but the way the FP values are stored is really hardware-dependent and, also, language and compiler dependent. The net result is that what is stored in a double or float variable is only an approximation of the value you placed there. Now, if the value differs too much from what you placed in there then there's either a hardware error (faulty FPU, bug in the FP implementation, bad memory, even power current can be blamed in some situations since logic gate outputs do not use precisely defined voltage values to indicate operation status, but rather current ranges - unstable current can, albeit very rarely, influence the result) or some sort of software error (faulty FPU state save/restore, precision issues, memory corruption etc). The latter reason is why I asked Rodrigo to take a look, although I think it's more probable to put the "blame" (not really a blame, though) on the FPU/CPU in this case. The hardware being the cause is more likely when you consider that you weren't able to reproduce the crash in the emulator.

There's also the definition of FP in .NET which allows for deviations, please take a moment to read this article (I'm sure there are others that explain it as well): http://csharpindepth.com/Articles/General/FloatingPoint.aspx
Comment 15 Rowan 2016-01-31 20:03:12 UTC
I posted a log yesterday from the emulator in comment 10, it is not exact but a very weird case is happening in the emulator when I increase the memory being allocated.

Yes I am aware of precision of floats, but a constant float passed to a function that is not modified in that function should not go from 3.1415 to negative infinity, there was no FPU operation, just pass a value to a function and that value is not what it should be. And the fact that this issue is happening on at least 2 completely different hardware platforms (and other strange issues happening in real x64 devices and also in emulated android x64) mean it is seems unlikely to be a hardware issue.

So deviation is fine, but the issue does not look like deviation, it looks like the number has been completely corrupted, like the memory has become corrupted.
Comment 16 Rowan 2016-01-31 20:36:04 UTC
ok to prove the point I removed the DateTime and put that function from the image.

I get exactly the same crazy value for max every time.

Please take a close look at the code posted below. I have the function SetBounds, and I pass it a value that no precision errors could effect as 0 is easily less than 3.1415

And from the logCat I print the value of max when it fails

01-31 17:29:44.472: E/mono(10334): System.Exception: max is corrupt, its value is: -5.919196E+21

It is always this value, so something has gone very wrong.

public class ThreadObj
{
    private readonly Thread _thread;
    private readonly List<byte[]> list = new List<byte[]>();

    public ThreadObj()
    {
        _thread = new Thread(DoDates);
        _thread.Start();
    }

    private void DoDates()
    {
        while (true)
        {
            while (list.Count < 100)
            {
                list.Add(new byte[100]);
            }

            while (list.Count > 0)
            {
                list.RemoveAt(0);
            }

            var val = 0f;
            for (var i = 0; i < 100; i++)
                SetBounds(ref val, -Pi, Pi);
        }
    }

    public const float Pi = (float)Math.PI;
    public const float TwoPi = (float)(Math.PI * 2f);

    public static void SetBounds(ref float value, float min, float max)
    {
        if (value > max)
        {
            throw new Exception("max is corrupt, its value is: " + max);
        }
    }
}
Comment 17 Marek Habersack 2016-01-31 21:38:55 UTC
Rowan, I'm not saying you're wrong, not at all. It might be memory corruption or it might be the hardware - I don't know yet, let's wait for input from Rodrigo.
Comment 18 Rowan 2016-02-01 00:47:18 UTC
thanks for your help, I am just trying to provide as much evidence as I can to help resolve this as soon as possible.

I think maybe this is all related the the garbage collector. You are right that on the ASUS it is probably an OOM. However the garbage collector is never called. I did some more tests and the more free memory I have on the device the longer it takes to crash. I added a counter to count how much memory is allocated and it pretty much is the free space before the app was started. So the question is why is the garbage collector not being called. It was 7 seconds from when the first allocation was done until the app crashed Maybe my threads running all the time are preventing the garbage collector from running. But they are only normal priority so not sure why that would be the case. So it was a large period of time for the GC not to kick in.

Maybe on the imx6 and the rockchip 3066 they are also out of memory exceptions but the older version of android 4.4 and 4.1 the memory is still being allocated but since the memory has run out it is allocating memory that is already used.

I am going to roll back to some older Xamarin releases and see if the problem has been introduced recently.

Thanks again for your help
Comment 19 Marek Habersack 2016-02-01 09:23:17 UTC
It does indeed look like a memory management/corruption issue, possibly the GC plays a role too (although I don't think it can be blamed for this particular instance of memory corruption). We really need Rodrigo + team to chime in :)
Comment 20 Rodrigo Kumpera 2016-02-01 12:49:36 UTC
Hey Ludo,

This could be either one of the last two issues you troubleshooted, right?
Comment 22 Rowan 2016-02-03 14:41:13 UTC
I am excited to try it! but that link was to a .pkg? I am using windows, is a .pkg for mac? How do I install that on windows?
Comment 23 Rowan 2016-02-04 17:33:27 UTC
the suspense is killing me...can someone please make that build available for windows?
Comment 24 Ludovic Henry 2016-02-04 19:02:30 UTC
Hello Rowan, I am working on getting a windows build ASAP. I will come back to you as soon as I have one. Sorry for the delay.
Comment 25 Ludovic Henry 2016-02-05 14:12:02 UTC
Hello Rowan, please try with the windows build available here : https://code.internalx.com/Wrench/BuildDetails?buildId=76cae85a42a865a0ced1919c3edf
Comment 26 Rowan 2016-02-05 14:31:39 UTC
Hi Ludovic, I have tried that version out but unfortunately it gave the exact same result as before.

the version info after the install
Xamarin   4.0.1.90 (3d3c042)
Xamarin.Android   6.0.1.8 (d3d044d)
Comment 27 Rowan 2016-02-08 20:33:31 UTC
Guys anything I can do? I am available 24x7 to assist in any way possible. This bug is critical for us and we really need this solved as soon as possible. Please let me know if there is anything I can do to help.
Comment 28 Rodrigo Kumpera 2016-02-08 20:47:31 UTC
Hi Rowan,

Sorry for not updating the bug more frequently, but we're actively investigating it.

Please understand that this is no ordinary bug, it takes quite some time to root out the issue.

As soon as we have a candidate fix, we'll update this bug, with a build so you can validate it.
Comment 29 Rowan 2016-02-08 21:07:50 UTC
Thanks for the update Rodrigo, glad to hear to hear it is being looked into, let me know if you find a solution or need any other tests or information.
Comment 30 Rowan 2016-02-18 14:10:53 UTC
Hi Rodrigo, I was wondering if you have any news on this yet? Do you have any rough estimations yet of the timeline this will take to resolve?
Comment 31 Ludovic Henry 2016-02-22 11:44:27 UTC
Hi Rowan, I will dedicate my week to this bug, so hopefully I should get something by the end of it.
Comment 32 Ludovic Henry 2016-02-22 16:02:35 UTC
Rowan,

I have been trying to reproduce the issue with the project attached (https://bugzilla.xamarin.com/attachment.cgi?id=14783), and the code at https://bugzilla.xamarin.com/show_bug.cgi?id=38161#c16. But I was unable to reproduce the crash with : a Nexus 5 (Android 6.0.1), a Nexus 5 armv7 emulator (Android 4.4.2 and Android 6.0.1) and a Nexus 5 x86 emulator (Android 4.4.2 and Android 6.0.1), and these with Xamarin.Android 6.0.1.7, 6.0.1.9 and 6.0.1.10.

Could you provide more informations regarding the different version/devices/emulators you are using to reproduce this crash?

Thank you!
Comment 33 Rowan 2016-02-22 17:22:20 UTC
Hi Ludovic,

Our company makes a device using the Freescale/NXP imx6 processor with custom hardware. So all the issues our customers are having are happening on this processor. 

However in my tests with other consumer consumer devices and it was hit and miss which ones repeated the problem.

It does not seem to be dependent on the android OS version, more on the hardware. As I tried multiple OS versions with the imx6 and it crashed the same on each.

I have a dual core RockChip 3066 hdmi stick that has exactly the same behavior as the imx6 does but again this is an old device so very unlikely you have one.

With the Asus Zenphone 2 it happens very fast. However this is an intel x64 chip so it is a bit of a deviation from our target hardware so I am sure this is related to what we are seeing on the arm hardware, however the crash is not exactly the same, on the Asus it actually crashes the app, on the arm it returns a different float value. 

The nexus 5 and nexus 7 do not cause the issue as I have tested with them and they worked fine.

On the Samsung Galaxy Note 2 it seems to work fine for hours on end, but on occasion the phone has rebooted.

I have only managed to get the emulator to crash once with the log that I posted above which again was a different crash message, I will try and fine tune the application with different memory allocation sizes to see if I can get it more repeatable with the emulator.

What other devices do you have access to? Basically try what ever you have but arm v7 that is a few years old seems to cause the problem, as well as the Zenphone 2 for newer hardware.

It is a pain that it is not repeatable on all hardware but lets hope you have a device that this repeats on.
Comment 34 Rowan 2016-02-22 17:56:45 UTC
Testing on some phones around the office these 2 also crashed in less than 10 seconds.

Zenphone 5
Samsung Galaxy Win Duos
Comment 35 Rowan 2016-02-22 18:00:52 UTC
Samsung Galaxy Win2 (model: SM-G360M) also fails in less than 5 seconds
Comment 36 Rowan 2016-02-22 18:02:17 UTC
Galaxy note 3 (SM-N7502) less than 5 seconds
Comment 37 Rowan 2016-02-22 18:18:38 UTC
Created attachment 15135 [details]
galaxy note 3 logcat

This was a log from the Note 3 from booting, then just running the app.

search for "Unhandled Exception", this is exactly the same result we get on the imx6
Comment 38 Rowan 2016-03-01 18:42:02 UTC
Hi Ludovic, Do you have an update on this? Have you tried to repeat this? Were you successful?

Any update on the progress would be much appreciated. If you don't have many phones to try this on I can buy a galaxy note 3 and one have it shipped to your location.

Looking forward hearing about the progress, thanks again.
Comment 39 Ludovic Henry 2016-03-01 23:52:18 UTC
Hi Rowan,

The device is on its way, so as soon as I have it, I will keep you updated.

Thank you,
Ludovic
Comment 40 Ludovic Henry 2016-03-09 12:06:59 UTC
Hi Rowan,

Good news, I got the device and could reproduce the bug. I really think it's not going to be an easy one to investigate, so it will take a few days.

Thank you,
Ludovic
Comment 41 Rowan 2016-03-09 15:23:21 UTC
great news! thanks for the update, good luck
Comment 42 Ludovic Henry 2016-03-15 17:25:20 UTC
Quick update: I could reduce the test case to https://gist.github.com/ludovic-henry/c708d24973bb903a616c. The issue seems to be related to suspending and signal handler redzones. I will keep digging into that.
Comment 43 Rowan 2016-03-15 19:33:44 UTC
thanks for the update
Comment 44 Zoltan Varga 2016-03-17 10:18:39 UTC
Should be fixed by mono 59887e3736a40dd6bda145954af0ee241bb48185. Thanks for the report.

This should be backported to a service release.
Comment 45 Ludovic Henry 2016-03-17 14:17:56 UTC
I confirm that the bug was fixed with Zoltan fix: I could run for a few hours without a crash.
Comment 46 Rowan 2016-03-17 14:58:23 UTC
Nice work guys! you have me very very excited!

Do you have a build yet that you can release externally so I could test it also?
Comment 49 abhi 2016-05-05 12:26:31 UTC
Reproduce Status
*************************
I checked this issue with Xamarin.VisualStudio_4.0.1.73_3e7e678d46c59b2a4f7bd9310eb2b51f0a5bf100 and able to reproduce this issue.

ScreenCast:http://www.screencast.com/t/OyBCRYrv

*************************
Verified Status
*************************

I checked this issue with latest C6SR4 Xamarin.VisualStudio_4.0.4.0_6d27f27d5c4512395ea73b1d6644439c5445c06c 
and it is working fine.

Environment Info : https://gist.github.com/abhisheks360/3ea6fbde1b7972fc4486a65ed6cf56a2

Hence, closing this issue.