Bug 57313 - Xamarin Android Weak Reference Table Overflow
Summary: Xamarin Android Weak Reference Table Overflow
Status: REOPENED
Alias: None
Product: Android
Classification: Xamarin
Component: Mono runtime / AOT Compiler (show other bugs)
Version: 7.3 (15.2)
Hardware: PC Windows
: --- normal
Target Milestone: ---
Assignee: Jonathan Pryor
URL:
Depends on:
Blocks:
 
Reported: 2017-06-09 12:12 UTC by Callan Hodgskin
Modified: 2017-08-01 01:26 UTC (History)
3 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
ADB Crash Log 02 (118.42 KB, text/plain)
2017-06-27 00:42 UTC, Callan Hodgskin
Details
AD Crash Log 01 (64.22 KB, text/plain)
2017-06-27 00:44 UTC, Callan Hodgskin
Details
Pathos Logs 03 (No Crash) (5.60 MB, application/x-7z-compressed)
2017-06-28 12:30 UTC, Callan Hodgskin
Details

Description Callan Hodgskin 2017-06-09 12:12:46 UTC
I have reports from a number of users that my app is crashing after ~5 minutes of intensive use. I have received crash logs on Google Play and an example is attached below. The message seems to be:

    JNI ERROR (app bug): weak global reference table overflow (max=51200)'

The issue has occurred on several different devices including the Nexus 5X, Nexus 6P, Galaxy S7 and Pixel XL. 
 I'm not familiar with the JNI and would appreciate so any advice/explanations/suggestions on how to figure this out.

The relevant code can be found in my open source project: https://gitlab.com/hodgskin-callan/Invention. However, I don't have the minimum code to reproduce the issue and it does not reproduce on my Nexus 9. I suspect this issue is not affecting the majority of my Android users.

I have integrated HockeyApp but this type of crash cannot be tracked. According to the HockeyApp support: "What you are seeing here is a native (NDK) crash. Our SDK doesn't catch those as it covers Xamarin Exceptions and Java exceptions only."

Related links for this issue:
* https://stackoverflow.com/questions/43403968/how-do-i-diagnose-the-cause-of-a-xamarin-android-weak-reference-table-overflow
* https://support.hockeyapp.net/discussions/problems/74432-are-there-types-of-crashes-that-are-not-captured-for-xamarin-android?unresolve=true


The issue as reported in Google Play Crashes:

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'google/bullhead/bullhead:7.1.1/N4F26T/3687331:user/release-keys'
Revision: 'rev_1.0'
ABI: 'arm'
pid: 10404, tid: 10404, name: .x10host.pathos  >>> com.x10host.pathos <<<
signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
Abort message: 'art/runtime/indirect_reference_table.cc:132] JNI ERROR (app bug): weak global reference table overflow (max=51200)'
    r0 00000000  r1 000028a4  r2 00000006  r3 00000008
    r4 f300558c  r5 00000006  r6 f3005534  r7 0000010c
    r8 00000000  r9 0000000a  sl 00001785  fp f0385400
    ip 0000000b  sp ffde7b50  lr f1a065e7  pc f1a08e44  cpsr 200f0010

backtrace:
    #00 pc 00049e44  /system/lib/libc.so (tgkill+12)
    #01 pc 000475e3  /system/lib/libc.so (pthread_kill+34)
    #02 pc 0001d8a5  /system/lib/libc.so (raise+10)
    #03 pc 000193f1  /system/lib/libc.so (__libc_android_abort+34)
    #04 pc 00017034  /system/lib/libc.so (abort+4)
    #05 pc 0031a5f1  /system/lib/libart.so (_ZN3art7Runtime5AbortEPKc+328)
    #06 pc 000b5205  /system/lib/libart.so (_ZN3art10LogMessageD2Ev+1132)
    #07 pc 001bc42b  /system/lib/libart.so (_ZN3art22IndirectReferenceTable3AddEjPNS_6mirror6ObjectE+194)
    #08 pc 0023a097  /system/lib/libart.so (_ZN3art9JavaVMExt16AddWeakGlobalRefEPNS_6ThreadEPNS_6mirror6ObjectE+46)
    #09 pc 0027f483  /system/lib/libart.so (_ZN3art3JNI16NewWeakGlobalRefEP7_JNIEnvP8_jobject+418)
    #10 pc 0000de14  /data/app/com.x10host.pathos-2/lib/arm/libmonodroid.soapp/com.x10host.pathos-2/lib/arm/libmonodroid.so
Comment 1 Callan Hodgskin 2017-06-27 00:42:29 UTC
Created attachment 23124 [details]
ADB Crash Log 02

The attached log was traced by a user when a crash occurred.

Illustrates this error:

JNI ERROR (app bug): weak global reference table overflow (max=51200)
Comment 2 Callan Hodgskin 2017-06-27 00:44:34 UTC
Created attachment 23125 [details]
AD Crash Log 01

This is another crash log traced by the same user as Crash Log 02. The crash does not have the expected signature but I have included in case it gives useful context.
Comment 3 Callan Hodgskin 2017-06-27 00:51:13 UTC
Well great, I put the crash log comments the wrong way around and I can't find a way to edit. These are the correct descriptions:


ADB Crash Log 01

The attached log was traced by a user when a crash occurred. Illustrates this error: JNI ERROR (app bug): weak global reference table overflow (max=51200)

AD Crash Log 02

This is another crash log traced by the same user as Crash Log 02. The crash does not have the expected signature but I have included in case it gives useful context.
Comment 4 Jonathan Pryor 2017-06-27 17:22:59 UTC
Attachment #23125 [details] contains:

> JNI ERROR (app bug): weak global reference table overflow (max=51200)
> weak global reference table dump:
> ...
>   Summary:
>     15604 of android.graphics.Paint (15604 unique instances)
>     13122 of android.view.RenderNode (13122 unique instances)
>      7517 of md54b918eb979e7742efb4e28790b7a57be.AndroidLayoutBackground (7517 unique instances)

It's troubling that there are 7517 instances of `AndroidLayoutBackground`, as each instance contains *two* Paint instances (Paint, BorderPaint), among other instances. AndroidLayoutCanvas in turn contains the AndroidLayoutBackground instance, but we don't see nearly as many AndroidLayoutCanvas instances in the output.

The major problem is that this app is not GREF friendly:

https://developer.xamarin.com/guides/android/advanced_topics/garbage_collection/
https://developer.xamarin.com/guides/android/troubleshooting/troubleshooting/#GC_Messages

I would suggest working to *minimize* GREF use as much as possible. This could involve adding a GC.Collect() to Activity.OnDispose()...or something more significant.

It may be useful to grab the GREF logs to see where the instances are being created. To obtain the GREF logs, run:

> $ adb shell setprop debug.mono.log gref,gc

Run your app, and get it to crash.

Next, obtain the GREF log:

> adb shell run-as @PACKAGE_NAME@ cat files/.__override__/grefs.txt > g.txt


Note: @PACKAGE_NAME@ is the package name of your app.

Once you have `g.txt` and the error message from `adb logcat`, you can look for the +w+ messages to see where a Weak GREF is created.
Comment 5 Callan Hodgskin 2017-06-28 12:28:20 UTC
Hello Jonathon and thank you for your analysis and insights. Would you mind clarifying what you mean by 'the major problem is that this app is not GREF friendly'? Sounds like you are suggesting that I need to use GC.Collect manually to avoid the weak grefs from accumulating. Is this a crash a known issue in this way?

Regardless, I've restructured the AndroidLayoutBackground to share the paint instances (so there are only two now). I have also restructured to avoid creating the *ListenerImplementor instances.

I asked my user to obtain the GREF logs as you suggested. As you probably know, this is not possible unless the app is debuggable=true. Anyway, I set this field and gave the user a RELEASE .apk to sideload. The app was running much slower (as expected due to the logging) and he was no longer able to crash the app. Not sure if this means anything to you? I will attach the logs from his session anyway.

My preliminary research into the GREF logs found a curious pattern:

handle 0x101ba6; key_handle 0xe07e256: Java Type: `md56302800db530d1c459a801f5e3e263f4/AndroidLayoutBackground`; MCW type: `Inv.AndroidLayoutBackground`
*take_weak obj=0xc78ff5a0; handle=0x1019ea
+w+ grefc 1288 gwrefc 1 obj-handle 0x1019ea/G -> new-handle 0x200d83/W from thread 'finalizer'(12593)

It seems that the weak references are taken in the 'finalizer' thread. From my limited understanding and reading, I think this is as expected and normal?
Comment 6 Callan Hodgskin 2017-06-28 12:30:13 UTC
Created attachment 23170 [details]
Pathos Logs 03 (No Crash)

After turning debuggable=true on my RELEASE build, the app no longer crashes for my user. The 7zip file attached contains the session including the GREF log file.
Comment 7 Callan Hodgskin 2017-06-29 01:36:59 UTC
Okay, some more findings from the original Crash 01 and Crash 02 logs.


06-26 12:23:05.490 17156 17156 I monodroid-gc: 46080 outstanding GREFs. Performing a full GC!
06-26 12:23:06.432 17156 17156 F art     : art/runtime/indirect_reference_table.cc:132] JNI ERROR (app bug): weak global reference table overflow (max=51200)


I am exceeding the 46800 gref threshold so it performs a full GC. You can see in the timestamps that it spends nearly a full second performing the GC. The next line is the weak global reference table overflow. It is likely that the finalizers caused 5200 more weak reference objects to be made. This hits the 52,000 limit and then we see the line 'Runtime aborting...' and end up with a crashed app.

I think the users who are experiencing the problem are 'power users' and are driving the app in manner that leads to lots of gref garbage that needs to be cleaned up. Unfortunately, the mono runtime has a weakness and can't cope with this amount of garbage at once.

In defense of my code, people once thought that 640K was enough memory! :-) Is there anything that can be done to increase the 52,000 limit of gref?  Running slow in the GC is not unreasonable, but crashing hard like this is terrible.
Comment 8 Alex V 2017-07-31 20:30:51 UTC
Is there any new information about this issue?
Is it going to be fixed anytime soon?
Comment 9 Callan Hodgskin 2017-08-01 01:26:51 UTC
Hey Alex, I've written an article on this issue (and a different one affecting Xamarin.iOS). https://medium.com/@hodgskin.callan/flaws-in-xamarin-memory-management-aecbb5c45895

The salient points:
* Reduce the number transient Android objects.
* Call GC.Collect as much as you can.

This has been enough to mitigate the crashing I was having with my app. It would be nice to get a straight acknowledgement from Xamarin that this is an implementation flaw in the framework.

Note You need to log in before you can comment on or make changes to this bug.