Bug 41995

Summary: [Concurrent SGen] Assertion: should not be reached at sgen-marksweep.c:1103
Product: [Mono] Runtime Reporter: Łukasz Domeradzki <archipl33>
Component: GCAssignee: Vlad Brezae <vlad.brezae>
Status: RESOLVED FIXED    
Severity: normal CC: kumpera, mono-bugs+mono, mono-bugs+runtime
Priority: ---    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Tags: Is this bug a regression?: ---
Last known good build:
Attachments: Reproducable solution

Description Łukasz Domeradzki 2016-06-20 12:59:37 UTC
Latest Mono from master compiled --with-concurrent-gc=yes causes assertion at https://github.com/mono/mono/blob/master/mono/sgen/sgen-marksweep.c#L1103.

At the time of writing it was the only assertion in function pin_major_object().

The symptoms are similar to memory leak - Mono runtime takes excessive amount of memory on every allocation that is not given back. It's not a size of the heap, as I limited max-heap-size to 256m, yet Mono at the time of crash was using 38.9% of my memory, around 680 MB of actual usage, and all assemblies calculated together have less than 5 MB in total.

I don't have a reproducible test case ATM, as the issue is rather semi-rare and requires some time to progress. However I think that running several async tasks on threadpool every few seconds that allocate fixed size of memory should lead to such crash, at least that's what is happening in my case.

Regards,
Łukasz
Comment 1 Łukasz Domeradzki 2016-06-20 13:01:51 UTC
Sorry, of course I meant --with-sgen-default-concurrent=yes and not --with-concurrent-gc=yes. My memory is faulty :).
Comment 2 Rodrigo Kumpera 2016-09-14 01:24:38 UTC
We recently merged some changes that addressed some of those issues.

If they still occur with current master, please reopen this bug.
Comment 3 Łukasz Domeradzki 2016-09-14 01:28:31 UTC
Thank you for answer Rodrigo. If that's the case, I'll recompile my Mono with concurrent sgen once again and test if I spot any anomalies regarding that issue (or maybe some other one).

Big kudos to you and the whole team!
Comment 4 Vlad Brezae 2016-09-14 11:02:29 UTC
Given there are in the end 2 different collectors there can be some memory usage differences. The concurrent collector also tends to keep slightly more objects alive, collecting them later on. This could translate to a few percentages increased memory consumption.

Having said that, we currently don't have great recovery from out of memory situations, which is the case on both collectors, especially indeed with the concurrent one, and is something that I'm looking into right now. I'm reopening this since we still have some assertions and special cases that we need to account for.

In order to help us see whether there is indeed a memory leak in sgen, if you still can get a repro of this, given latest changes on master, you could try adding the --stats argument to mono which prints some stats at the end of the run (especially the valloc stats are useful). In order to print the stats at crash time you can pass MONO_DEBUG=suspend-on-sigsegv, then attach to suspended mono and from debug terminal invoke mono_counters_dump (0xffff, stdout).
Comment 5 Łukasz Domeradzki 2016-09-14 15:25:03 UTC
Sadly the bug is still here, and it's not "slightly more objects alive", I woke up to see this: http://i.imgur.com/G3bgUiZ.png

My bot usually takes 8-12% memory usage after a few days of running, today the same code run with concurrent SGen after only ~9 hours of work reached 40% memory, and I do use very restrictive GC_PARAMS for it in order to minimize memory usage, mainly soft-heap-limit set to 256 MB which is sufficient - in this case runtime nearly quadrupled that amount, so definitely something is not right.

I'll follow your tips and maybe try to write some reproducable case, but there is definitely something fishy going on and memory not being fully released, the more it's being used the more leaks I can observe, and it's literally balooning for no reason. If I allowed my bot to run longer, soon kernel would kill it as it already filled 50% of swap. Such thing never happened without concurrent SGen.

Just to clarify: I didn't observe crash anymore, just memory leak this time. Perhaps crash would come later, but I couldn't wait to run OOM. I'll test it on another machine and hopefully submit more useful information in next comment.
Comment 6 Vlad Brezae 2016-09-14 15:34:50 UTC
Most of the memory used by sgen should come from the heap so if the heap size is limited and you have that much memory usage, sounds like there's a leak indeed. Either run latest master with --stats and attach/dump counters when the memory usage is high or repro is also useful. It doesn't really matter if it crashes or not, as soon as the memusage is unusually high, since that is the triggering problem in the first place.
Comment 7 Łukasz Domeradzki 2016-09-15 03:25:19 UTC
Created attachment 17497 [details]
Reproducable solution
Comment 8 Łukasz Domeradzki 2016-09-15 03:35:23 UTC
I managed to reproduce this bug in at least semi-reliable way.

The same code results in two totally different memory usages:

Without conc SGen: http://i.imgur.com/GuJn2vs.png (--with-sgen-default-concurrent=no)
With conc SGen: http://i.imgur.com/finpeZ0.png (--with-sgen-default-concurrent=yes)

As you can see both screenshots were done after around ~2 minutes of CPU time of both apps, to minimize risk of potentially different GC implementation. Conc SGen took more than 8x the memory after 2 minutes of running on my machine, and it keeps increasing - if you leave it for longer while, it'll probably crash or run into OOM. Non-conc SGen keeps very static memory usage of 0.9-1.1.

In both cases I used very retrictive MONO_GC_PARAMS, which you can find in reproduce.sh.

Notice: don't trust GC.GetTotalMemory(), it lies - trust htop :).

I don't know yet what is the root of this issue, I just wrote a quick code to confirm my initial guess with some threadpool/tasks leaks. Non-conc SGen works excellent in this case, while conc SGen increases it's memory usage with every second.

I hope you'll be able to reproduce it on your machine too. I'll try to find some more clues maybe in Mono --stats you suggested later and post extra findings (if any) tomorrow, as it's super late for me :). In any case, let me know if you're able to reproduce it, as it might be also connected  with my specific setup and compilation flags, but I doubt it considering that just turning off concurrent SGen fixes the issue for me completely.

Thanks in advance for looking into this issue!
Comment 9 Vlad Brezae 2016-09-15 14:48:57 UTC
Thanks you for the repro. I've been able to reproduce the issue there, fix is underway (https://github.com/mono/mono/pull/3567). The problem was that we were leaking gc internal memory per major concurrent collection (GC.GetTotalMemory() only reports managed memory). Let me know if this fixes the leak on your app.

In order to leak a few hundred MB, the test had to trigger about 100000 majors. As a side note, if this was the same leak that happened on your app, it might mean that your app is triggering an unusual amount of majors so performance will greatly suffer there, in case that is a concern.

Also you can enable the concurrent collector by MONO_GC_PARAMS=major=marksweep-conc (instead of doing full rebuilds with that flag all the time)
Comment 10 Łukasz Domeradzki 2016-09-15 15:19:42 UTC
I'm glad you found out the issue. Yeah the amount of major collections in my case is greatly connected with very aggressive MONO_GC_PARAMS I'm using, as I have only 2 GB of dedicated memory on my server and I must keep it low to avoid swapping (which is even slower than major collection). I also do work with quite big amount of data concurrently (every 15 minutes), so it's totally possible that after ~9 hours it could result in such stats - I managed to do that on one thread in only a ~5 minutes, while my real app works in concurrent way heavily (and it had 42 minutes of CPU time when I took initial screenshot).

In any case, thanks for tip, and double thanks for solving the issue, I'm very happy that I could help locating the culprit, and I'll be even more happy to test if your fix solves my initial problem after PR gets merged. I appreciate the help very much!
Comment 11 Vlad Brezae 2016-09-19 18:15:07 UTC
Fixed on master. We will soon do some additional fixes to handle the behavior and some asserts in OOM scenarios, but you shouldn't reach those cases anymore since we are no longer leaking memory to lead there. Marking as fixed.

Let me know if you encounter any problems.
Comment 12 Łukasz Domeradzki 2016-09-20 12:25:55 UTC
I can confirm that I'm no longer seeing any issues regarding to memory leak with above PR merged, both in reproducable case as well as my main project. Thank you Vlad for fixing the culprit, I'm currently testing how concurrent SGen behaves after this issue has been fixed, so I'll for sure let you know if I spot any anomalies, although from first look everything looks as it should now :).

Thanks once again!