Bug 42606

Summary: [Regression] Compact seq point enabled by default causes 100% CPU usage
Product: [Mono] Runtime Reporter: Łukasz Domeradzki <archipl33>
Component: JITAssignee: Bernhard Urban <bernhard.urban>
Severity: normal CC: henrik, kumpera, luis.aguilera, masafa, mono-bugs+mono, mono-bugs+runtime, narayanp, sachins, xamarin
Priority: High    
Version: unspecified   
Target Milestone: 4.6.0 (C8)   
Hardware: PC   
OS: Linux   
Tags: C8Beta1 Is this bug a regression?: Yes
Last known good build:
Attachments: Reproducable test case

Description Łukasz Domeradzki 2016-07-15 20:08:37 UTC
In commit https://github.com/mono/mono/commit/209c4246adbba994dfcecbc1e93481016b8d9f44 seq points were enabled by default, which resulted in a regression for my specific setup.

The bug is rather weird, as it looks like Mono runtime can stop executing the program at random point by running into 100% CPU usage (infinite loop?) which causes the runtime to be stuck, without any crash or segfault, even after several hours.

The bug is not reproducable with MONO_DEBUG=no-compact-seq-points option set, which disables seq points once again (reverting to old behaviour).

I reproduced the issue not only in latest master, but also in older Mono versions (4.2 tested) with MONO_DEBUG=gen-compact-seq-points enabled, so this bug wasn't introduced in above commit, but much much earlier, perhaps it was even incompatible from the start but I just ran into it with my specific setup.

It's also not that easy to reproduce, as it looks like smaller projects are working 100% correctly, while the bug is hitting only bigger, more complex projects with more classes - I reproduced it by running ILRepack as well as running my own self-compiled xbuild'ed project, but I couldn't reproduce it with my own self-compiled two smaller projects, which seem to be working perfectly fine even in faulty setup.

I suspect this is the result of gcc-specific flag such as stripping debug symbols from output binary, that I'm currently trying to find out. I'll update this issue once I gather more relevant information which could help locate the culprit.

Comment 1 Łukasz Domeradzki 2016-07-15 20:46:23 UTC
Created attachment 16692 [details]
Reproducable test case

I'm sorry for the format but as noted in the bug, this issue is specific and doesn't happen with every project, so it's easier for me to include something bigger right away in order to have reliable test case.
Comment 2 Łukasz Domeradzki 2016-07-15 20:49:55 UTC
I was close - it wasn't gcc optimization flag, but in fact Mono one.

I attached reproducable test case that should be working on both new (master) and old (4.2/4.4) Mono versions, just remove unnecessary export if you're running master, as this is default now.

The culprit is combination of abcrem and tailc optimizations - using just one doesn't trigger it, but using both does. Likewise, -O=all also triggers the issue, but default run (without -O) seems to work fine, probably that's why it wasn't caught earlier.

Once again I'm sorry for "format" of the test case, but that was the best I could come up with. If it isn't satisfying, you can try with any bigger C# project, but the bug doesn't trigger if binary doesn't meet some "requirements" - it looks like size issue to me, but maybe it's something more specific, I don't know.

Hope it helps, I did everything I could to pinpoint the issue, rest is upon you. Good luck and thank you in advance!
Comment 3 Zoltan Varga 2016-07-16 08:51:30 UTC
I can reproduce this on osx with mono master.
Comment 4 Rodrigo Kumpera 2016-07-20 01:17:00 UTC
Hey Marcos,

Can you take a look at this one.
Comment 5 marcos.henrich 2016-07-20 17:21:17 UTC
Hi Lukasz, thanks for the detailed bug report.

Compact sequence points are somewhat problematic because we emit a bunch of temporary sequence points opcodes that break optimizations that are not expecting them on the middle of others opcodes.

Our test suite verifies that the code generated with and without the sequence point opcode has the same size for each method. Running that test suite with -O=all showed no errors. Which means one of two things, either it does not cover the failing optimization, or the partially optimized code compiled with sequence point opcodes has the same size has the correct one.

I will run a script to check for diferences on the JIT compiled assembly between running your application with and without sequence points opcodes, hopefully some methods will diverge in size and we will be able to track the failing optimizations from there.
Comment 6 Łukasz Domeradzki 2016-07-20 19:13:25 UTC
Hi, and thank you for your response Marcos.

Yes, as I said, this bug is quite hard to reproduce, and I'm not entirely sure why, that's why I attached binary and reproduce script instead of usual testcase in source form. Feel free to modify included reproduce.sh script in order to find differences in the JIT assembly that could help you spot the real culprit, because as I noted, this issue happens only in this specific case of turning on those two Mono optimizations, and not in any other case. I'd be happy to further help, but unfortunately I don't have enough knowledge to even try to guess what might be wrong, as I'm not that experienced in this field - I'd be happy to fix the issue and send PR otherwise.

Regardless, I'm happy that you were able to reproduce the bug, as that's most important. It's not ILRepack-specific bug, as I was able to reproduce it also in other projects, I just considered to attach this one as it was easiest to use for this purpose. You can remove -O parameter to check that Mono is running fine without those two optimizations.

Thank you in advance for taking look into this issue.
Comment 7 Rodrigo Kumpera 2016-08-18 23:13:32 UTC
Hey Marcos,

Did you manage to reproduce the issue?
Comment 8 marcos.henrich 2016-08-22 15:22:27 UTC
Hey Rodrigo,

The issue can be reproduced as Łukasz suggested.

Alexander Köplinger also discovered a good way of reproducing it, which requires only mcs. By doing:

MONO_PATH=mcs/class/lib/net_4_x/ mono/mini/mono -O=all mcs/class/lib/build/mcs.exe Program.cs

class Program
    public static void Main(string[] args)

It only hangs when we use the param string[] in the Main method.

I changed our test suite to compare all the generated native opcodes, but we also have noise of a few differences caused by non determinism in the JIT compiler.

The pull request is here: https://github.com/mono/mono/pull/3425

I check manually with lldb and monobt the mcs test case and Mono.CSharp.VarianceDecl:CheckTypeVariance is calling Mono.CSharp.TypeSpec:get_TypeArguments in a infinite loop.
A diff on the jitted code with and without OP_IL_SEQ_POINT revealed differences:

The issue seems to be in:
000000000000042f  movq  %rax, %r13 | 000000000000042f  xorl  %eax, %eax

@Rodrigo It is quite hard to me to find the JIT compiler logic responsible by that difference. Can someone from the runtime team give me some insight on this?
Comment 9 marcos.henrich 2016-08-22 15:24:34 UTC
More detailed gist:
Comment 10 marcos.henrich 2016-08-22 15:36:48 UTC
I meant the change before the jmp:
0000000000000426 movq (%rbx,%rax), %r13  | 0000000000000426 movq (%rbx,%rax), %rax
Comment 11 Rodrigo Kumpera 2016-08-26 20:04:49 UTC
Great investigative job Marcos! That looks like an incompatible change.
Comment 12 Bernhard Urban 2016-08-31 10:54:55 UTC
in progress: https://github.com/mono/mono/pull/3492
Comment 13 Bernhard Urban 2016-09-01 05:25:46 UTC
backport: https://github.com/mono/mono/pull/3502
Comment 14 Bernhard Urban 2016-09-06 00:38:34 UTC
*** Bug 43004 has been marked as a duplicate of this bug. ***
Comment 17 Łukasz Domeradzki 2016-09-12 16:43:42 UTC
Yes, I can confirm that this issue has been fixed in https://github.com/mono/mono/pull/3492. Big thanks to everybody looking into the problem!