Bug 13813 - Mono crashes when AppDomains are created in parallel (few)
Summary: Mono crashes when AppDomains are created in parallel (few)
Status: RESOLVED FIXED
Alias: None
Product: Runtime
Classification: Mono
Component: GC (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Vlad Brezae
URL:
Depends on:
Blocks:
 
Reported: 2013-08-07 09:24 UTC by Alexandre Faria
Modified: 2017-05-31 00:34 UTC (History)
8 users (show)

See Also:
Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Repro crash log showing GDB crash output (1.64 MB, text/plain)
2017-05-22 04:25 UTC, Ben Burns
Details

Description Alexandre Faria 2013-08-07 09:24:29 UTC
Mono crashes when I try to create AppDomains in parallel, it varies, but only a few are required to crash mono.

Code:
using System;
using System.Threading;

public class A
{
        private static void ToDo(object stateInfo)
        {
                System.Console.WriteLine("\n\nIteration " + ((int)stateInfo));
                AppDomain ad = AppDomain.CreateDomain("ChildDomain");
                AppDomain.Unload(ad);
        }

        public static void Main(string[] args)
        {
                for(int i=0; i<4000; i++)
                {
                        ThreadPool.QueueUserWorkItem(new WaitCallback(ToDo), i);
                }
                Thread.Sleep(100000000);
        }
}

Error:
* Assertion: should not be reached at sgen-scan-object.h:111

Stacktrace:


Native stacktrace:

	mono() [0x4e3bb1]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0xfbd0) [0x7f3c4f8bbbd0]
	/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f3c4f305037]
	/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f3c4f308698]
	mono() [0x67385d]
	mono() [0x673996]
	mono() [0x61c931]
	mono() [0x612cb7]
	mono() [0x612ebf]
	mono() [0x61485a]
	mono() [0x615474]
	mono() [0x618ea7]
	mono(mono_gc_collect+0x28) [0x6190b8]
	mono(mono_domain_finalize+0x94) [0x5e7874]
	mono() [0x5deb9d]
	mono() [0x65f2b1]
	mono() [0x66f120]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e) [0x7f3c4f8b3f8e]
	/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3c4f3c7e1d]

Debug info from gdb:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operação não permitida.
No threads.

=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================
Comment 1 Alexandre Faria 2013-08-07 09:33:50 UTC
Mono JIT compiler version 3.3.0 (master/b213fac Qua Ago  7 10:19:08 WEST 2013)
Comment 2 Rodrigo Kumpera 2013-08-23 14:33:13 UTC
Mark, the above test crashes us very reliably and fast.
Comment 3 Zoltan Varga 2013-09-05 11:25:44 UTC

*** This bug has been marked as a duplicate of bug 14339 ***
Comment 4 Alexandre Faria 2013-09-06 10:26:09 UTC
I just updated from git and it still crashes.

I assumed the fix was already on git, is that correct?

Mono JIT compiler version 3.2.3 (master/81d7fec Sex Set  6 12:17:34 WEST 2013)
Copyright (C) 2002-2012 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
	TLS:           __thread
	SIGSEGV:       altstack
	Notifications: epoll
	Architecture:  amd64
	Disabled:      none
	Misc:          softdebug 
	LLVM:          yes(3.3svn-mono)
	GC:            sgen
Comment 5 Zoltan Varga 2013-09-06 10:47:08 UTC
Does it still crash the same way ?
Comment 6 Alexandre Faria 2013-09-06 11:01:12 UTC
Well, yes, but the assertion isn't present and it might take a bit longer to happen.

Stacktrace:


Native stacktrace:

	mono() [0x4e2b71]
	mono() [0x54885f]
	mono() [0x456677]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0xfbd0) [0x7fcbbbfa4bd0]
	mono() [0x61b750]
	mono() [0x5c3d99]
	mono() [0x612167]
	mono() [0x613baa]
	mono() [0x6147c4]
	mono() [0x618207]
	mono(mono_gc_collect+0x28) [0x618418]
	mono() [0x5ddf8e]
	mono() [0x65e611]
	mono() [0x66e4e0]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e) [0x7fcbbbf9cf8e]
	/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fcbbbab0e1d]

Debug info from gdb:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operação não permitida.
No threads.

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================
Comment 7 Mark Probst 2013-09-06 12:12:06 UTC
I can reproduce this.
Comment 8 Mark Probst 2013-09-06 14:06:11 UTC
Fixed in 0b296662aec9a33b27bd0a4ad6ee93f3fb3334c5.
Comment 9 Alexandre Faria 2013-09-06 18:55:08 UTC
Thanks, confirmed ;)
Comment 10 Alexandre Faria 2015-04-05 09:25:41 UTC
The same problem is back on more recent mono, crashing really fast and reliably...

Mono JIT compiler version 4.1.0 (master/199cc80 Dom Abr  5 11:19:29 WEST 2015)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
	TLS:           __thread
	SIGSEGV:       altstack
	Notifications: epoll
	Architecture:  amd64
	Disabled:      none
	Misc:          softdebug 
	LLVM:          supported, not enabled.
	GC:            sgen

The errors given vary a lot...
Comment 11 Mark Probst 2015-04-07 20:06:20 UTC
I can't reproduce this.  How often do you need to run it for it to crash?  Could you post crash logs?
Comment 12 Alexandre Faria 2015-04-08 08:08:42 UTC
I believe this is an sgen bug, with bohem it does not happen.

Every time I execute that code sample, it crashes just after a few seconds.

I just updated from mono to check if there was a recent fix, but its still there.

I get this kind of bugs with sgen in many more places, but this way has been the easiest to reproduce reliably and fast.

There is another code sample nicer and faster to crash (I used the other one for the sample crashes):

using System;
using System.Threading;
using System.Threading.Tasks;

public class Example
{
  public static void Main()
  {
    Parallel.For(0, 10000, i =>
    {
      System.Console.WriteLine("\n\nIteration " + i);
      AppDomain ad = AppDomain.CreateDomain("ChildDomain");
      AppDomain.Unload(ad);
    });
  }
}



Here you have some examples of crashes (taken from the original):

Example 1:

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at <0x00000 + 0x00000> <unknown method>
  at System.AppDomain.InvokeInDomain (System.AppDomain domain, System.Reflection.MethodInfo method, System.Object obj, System.Object[] args) <0x7f3d1bf57970 + 0x000a2> in <filename unknown>:0 
  at System.Runtime.Remoting.RemotingServices.GetDomainProxy (System.AppDomain domain) <0x7f3d1c035f80 + 0x00059> in <filename unknown>:0 
  at System.AppDomain.CreateDomain (System.String friendlyName, System.Security.Policy.Evidence securityInfo, System.AppDomainSetup info) <0x7f3d1bf57c30 + 0x00203> in <filename unknown>:0 
  at System.AppDomain.CreateDomain (System.String friendlyName) <0x7f3d1bf57bf0 + 0x00010> in <filename unknown>:0 
  at A.ToDo (System.Object stateInfo) <0x41ccbfb0 + 0x00095> in <filename unknown>:0 

Example 2:

* Assertion: should not be reached at sgen-scan-object.h:101

Stacktrace:


Native stacktrace:

	mono(mono_handle_native_sigsegv+0xc8) [0x4d08f8]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fd0877de340]
	/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fd08743fcc9]
	/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fd0874430d8]
	mono() [0x660f59]
	mono(monoeg_g_logv+0x3f) [0x66115f]
	mono(monoeg_assertion_message+0x96) [0x6612a6]
	mono() [0x60ca27]
	mono() [0x60e53b]
	mono() [0x602b9c]
	mono() [0x603da5]
	mono() [0x604409]
	mono(sgen_perform_collection+0x4c8) [0x6076d8]
	mono(mono_gc_collect+0x28) [0x607fe8]
	mono() [0x5d2eda]
	mono() [0x65aa1e]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fd0877d6182]
	/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fd08750347d]

Debug info from gdb:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operação não permitida.
No threads.

=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================


Example 3:

Stacktrace:


Native stacktrace:

	mono(mono_handle_native_sigsegv+0xc8) [0x4d08f8]
	mono(mono_arch_handle_altstack_exception+0xbe) [0x526e5e]
	mono(mono_sigsegv_signal_handler+0xf8) [0x44fca8]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7feeb41ec340]

Debug info from gdb:



Iteration 2870
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operação não permitida.
No threads.

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================


Example 4:

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
  at System.Runtime.Remoting.Proxies.RemotingProxy.Invoke (IMessage request) <0x7ff68e02cda0 + 0x003fc> in <filename unknown>:0 
  at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke (System.Runtime.Remoting.Proxies.RealProxy rp, IMessage msg, System.Exception& exc, System.Object[]& out_args) <0x7ff68e02b640 + 0x00452> in <filename unknown>:0 


Example 5:

Stacktrace:


Native stacktrace:

Segmentation fault (core dumped)
Comment 13 Rodrigo Kumpera 2015-04-08 10:26:06 UTC
Hi Mark,

Please take a look at this report.
Comment 14 Mark Probst 2015-04-08 16:41:59 UTC
The problem is cross-appdomain references.  Running with `MONO_GC_DEBUG=xdomain-checks` I get lots of this:

xdomain reference in 0x7ffff5c8e790 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff5c06af0 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5c8e790 in object 0x7ffff5c922f8 (Object[]) at offset 40
xdomain reference in 0x7ffff5ca79a0 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff5c0b4c8 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5ca79a0 in object 0x7ffff5cacf88 (Object[]) at offset 40
xdomain reference in 0x7ffff5cb3630 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff5c0a4c8 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5cb3630 in object 0x7ffff5cb70c0 (Object[]) at offset 40
xdomain reference in 0x7ffff5ca79a0 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff618c478 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5ca79a0 in object 0x7ffff5cacf88 (Object[]) at offset 40
xdomain reference in 0x7ffff5cb3630 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff5c0a4c8 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5cb3630 in object 0x7ffff5cb70c0 (Object[]) at offset 40
xdomain reference in 0x7ffff5ca79a0 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff618c478 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5ca79a0 in object 0x7ffff5cacf88 (Object[]) at offset 40
xdomain reference in 0x7ffff5cb3630 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff7f18770 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5cb3630 in object 0x7ffff5cb70c0 (Object[]) at offset 40
xdomain reference in 0x7ffff5ca79a0 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff618c478 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5ca79a0 in object 0x7ffff5cacf88 (Object[]) at offset 40
xdomain reference in 0x7ffff5cb3630 (System.Runtime.Remoting.Messaging.CADMethodCallMessage) at offset 48 (_method) to 0x7ffff7f18770 (System.Reflection.MonoMethod) ()  -  pointed to by:
found ref to 0x7ffff5cb3630 in object 0x7ffff5cb70c0 (Object[]) at offset 40
Comment 15 Alexander Kyte 2015-04-10 14:41:50 UTC
Those x-domain references were fixed in one of the commits on the xunit-fixes PR thread. This should be fixed when they can be merged. Still working out an issue that appeared when I rebased the commits to the head of master when they got out of date. 

- Alex
Comment 16 Matt Z 2015-06-11 03:37:02 UTC
I still see this, even with Alexander's xunit fixes that are now integrated:

* Assertion: should not be reached at sgen-scan-object.h:101
Stacktrace:
  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_vector (intptr,intptr,intptr) <0xffffffff>
  at (wrapper alloc) object.AllocVector (intptr,intptr) <0xffffffff>
  at System.Collections.Generic.HashSet`1.SetCapacity (int,bool) <0x0004e>
  at System.Collections.Generic.HashSet`1.IncreaseCapacity () <0x00043>
  at System.Collections.Generic.HashSet`1.AddIfNotPresent (T) <0x001df>
  at System.Collections.Generic.HashSet`1.Add (T) <0x00013>
  at Irony.Parsing.Construction.ParserStateData.AddItem (Irony.Parsing.Construction.LR0Item) <0x00047>
  at Irony.Parsing.Construction.ParserStateData.AddItem (Irony.Parsing.Construction.LR0Item) <0x0026f>
  at Irony.Parsing.Construction.ParserStateData.AddItem (Irony.Parsing.Construction.LR0Item) <0x0026f>
  at Irony.Parsing.Construction.ParserStateData.AddItem (Irony.Parsing.Construction.LR0Item) <0x0026f>
  at Irony.Parsing.Construction.ParserStateData.AddItem (Irony.Parsing.Construction.LR0Item) <0x0026f>
  at Irony.Parsing.Construction.ParserStateData.AddItem (Irony.Parsing.Construction.LR0Item) <0x0026f>
  at Irony.Parsing.Construction.ParserStateData..ctor (Irony.Parsing.ParserState,Irony.Parsing.Construction.LR0ItemSet) <0x00407>
  at Irony.Parsing.Construction.ParserDataBuilder.FindOrCreateState (Irony.Parsing.Construction.LR0ItemSet) <0x0012b>
  at Irony.Parsing.Construction.ParserDataBuilder.ExpandParserStateList (int) <0x00107>
  at Irony.Parsing.Construction.ParserDataBuilder.CreateParserStates () <0x00093>
  at Irony.Parsing.Construction.ParserDataBuilder.Build () <0x0006b>
  at Irony.Parsing.Construction.LanguageDataBuilder.Build () <0x001bf>
  at Irony.Parsing.LanguageData.ConstructAll () <0x00047>
  at Irony.Parsing.LanguageData..ctor (Irony.Parsing.Grammar) <0x001f7>
  at HP.Storage.Web.QueryFilterCriteriaBuilder..ctor (HP.Storage.Logging.ILog,HP.Storage.Web.IPropertyConverter) <0x000d3>
  at UnitTests.QueryNotExpressionGeneratesHqlNotExpression () <0x0006b>
  at (wrapper runtime-invoke) object.runtime_invoke_void__this__ (object,intptr,intptr,intptr) <0xffffffff>
  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.Reflection.MonoMethod.InternalInvoke (System.Reflection.MonoMethod,object,object[],System.Exception&) <0xffffffff>
  at System.Reflection.MonoMethod.Invoke (object,System.Reflection.BindingFlags,System.Reflection.Binder,object[],System.Globalization.CultureInfo) <0x000f2>
  at System.Reflection.MethodBase.Invoke (object,object[]) <0x0002a>
  at Xunit.Sdk.Reflector/ReflectionMethodInfo.Invoke (object,object[]) <0x00117>
  at Xunit.Sdk.FactCommand.Execute (object) <0x00034>
  at Xunit.Sdk.FixtureCommand.Execute (object) <0x001d4>
  at Xunit.Sdk.BeforeAfterCommand.Execute (object) <0x00190>
  at Xunit.Sdk.LifetimeCommand.Execute (object) <0x000d6>
  at Xunit.Sdk.ExceptionAndOutputCaptureCommand.Execute (object) <0x004bc>
  at Xunit.Sdk.TimedCommand.Execute (object) <0x0006b>
  at Xunit.Sdk.TestClassCommandRunner.Execute (Xunit.Sdk.ITestClassCommand,System.Collections.Generic.List`1<Xunit.Sdk.IMethodInfo>,System.Predicate`1<Xunit.Sdk.ITestCommand>,System.Predicate`1<Xunit.Sdk.ITestResult>) <0x00440>
  at Xunit.Sdk.Executor/RunTests/<>c__DisplayClass12.<.ctor>b__f () <0x00137>
  at Xunit.Sdk.Executor.ThreadRunner (object) <0x00070>
  at System.Threading.Thread.StartInternal () <0x0009d>
  at (wrapper runtime-invoke) object.runtime_invoke_void__this__ (object,intptr,intptr,intptr) <0xffffffff>
Comment 17 Alexander Kyte 2015-06-11 17:15:40 UTC
It was an issue that wasn't directly related to parallel appdomains, but those put enough memory pressure to expose a use-after-free bug.

https://github.com/mono/mono/pull/1869
Comment 18 Igor Kiselev 2015-12-04 02:46:09 UTC
I had issue with same stack trace as first here when we run JSIL tests on Travis-CI (https://s3.amazonaws.com/archive.travis-ci.org/jobs/77761849/log.txt)

With mono 4.2.1.102-0xamarin1 I still get an error, but now with different stack trace (https://s3.amazonaws.com/archive.travis-ci.org/jobs/94786977/log.txt):

Stacktrace:
Native stacktrace:
	mono() [0x49cf0c]
	mono() [0x4f2d5e]
	mono() [0x4249dd]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f3617b97cb0]
	mono(mono_metadata_type_hash+0x1) [0x55ca81]
	mono() [0x55ca4f]
	mono(mono_metadata_type_hash+0xb7) [0x55cb37]
	mono() [0x55cbe4]
	mono() [0x55cc26]
	mono() [0x55cc5d]
	mono() [0x62711e]
	mono() [0x627325]
	mono(mono_metadata_get_inflated_signature+0x12a) [0x560e3a]
	mono() [0x543466]
	mono() [0x45d672]
	mono() [0x4fd622]
	mono() [0x4fe93f]
	mono() [0x423a14]
	mono() [0x42408b]
	mono() [0x49f4b1]
	[0x4034717d]
Debug info from gdb:
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

We was adviced to install mono-runtime-dbg package. Probably I don't know how to use it properly, but stack trace with it installed is absolutely same  as it was without.
Comment 19 Rodrigo Kumpera 2015-12-04 14:30:27 UTC
Hi Igor,

How can I reproduce the JSIL crashes?
Comment 20 Igor Kiselev 2015-12-04 16:56:17 UTC
Unfortunately I can't provide any small isolated steps to reproduce an issue, only complex one,

So, we use next script to run test cases on Travis-CI: https://github.com/sq/JSIL/blob/master/.travis.yml

You need grab JSIL with subrepos from github before, and we have NodeJS (we export path to it in NODEJS environment variable) dependency to build.

You need define next environment variables:
export JsilUseAppDomainsInTest=true
export TestRun=JSIL.Tests.DeadCodeEliminationTest,JSIL.Tests.APITests,JSIL.Tests.AnalysisTests,JSIL.Tests.ComparisonTests,JSIL.Tests.ConfigurationTests,JSIL.Tests.DependencyTests,JSIL.Tests.FailingTests,JSIL.Tests.FormattingTests,JSIL.Tests.GenericsTests,JSIL.Tests.MetadataTests,JSIL.Tests.PerformanceTests,JSIL.Tests.TypeInformationTests,JSIL.Tests.UnsafeTests,JSIL.Tests.VerbatimTests,JSIL.Tests.XMLTests,JSIL.Tests.ThreadingTests

After it run test cases with next command:
mono ./packages/NUnit.Runners.2.6.4/tools/nunit-console.exe ./bin/Tests.DCE.dll ./bin/SimpleTests.dll ./bin/Tests.dll --run:$TestRun  --exclude:FailsOnMono\|\(FailsOnMonoWhenStubbed+Stubbed\)\|\(FailsOnMonoWhenStubbed+Translated\)

We see error once per several runs of test cases, about once per 3 runs.

I suppose that we have an error during Dispose method of JSIL.Tests.ComparisonTest (https://github.com/sq/JSIL/blob/f8848c1b09ad30fc1e5045d689a76e7d57b8db04/Tests/ComparisonTest.cs#L250):
        public void Dispose () {
            if (Evaluator != null)
                Evaluator.Dispose();

            if (AssemblyAppDomain != AppDomain.CurrentDomain) {
                var unloadSignal = new ManualResetEventSlim(false);
                ThreadPool.QueueUserWorkItem((_) => {
                    AppDomain.Unload(AssemblyAppDomain);
                    unloadSignal.Set();
                });

                unloadSignal.Wait(5000);
                if (!unloadSignal.IsSet)
                    throw new ThreadStateException("Timed out in AppDomain.Unload for test " + this.OutputPath);
            }
Comment 21 Igor Kiselev 2015-12-04 17:07:13 UTC
I will try to extract all App-Domain related logic into small app and will try to reproduce issue.
Will report if will be successful in it.
Comment 22 Rodrigo Kumpera 2015-12-04 20:46:36 UTC
Hi Igor,

This sort of test is very hard to build a test case so I wouldn't worry about that.

For now it's enough to have simple instructions on how to reproduce it. Reliably failing
is more important than a smaller test case.

I think we got enough to try to reproduce it ourselves.
Comment 23 Ben Burns 2017-05-19 02:35:31 UTC
I'm still seeing this issue in 4.8.0 and 5.0.0.

Native stack trace from 4.8.0:

Native stacktrace:
        mono() [0x4a77ca]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0) [0x7f9ee8fb00a0]
        /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f9ee8a30125]
        /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f9ee8a333a0]
        mono() [0x640879]
        mono() [0x640a87]
        mono() [0x640bd6]
        mono() [0x5ffeb0]
        mono() [0x5f21d9]
        mono() [0x5f3224]
        mono() [0x5f35a3]
        mono() [0x5f5852]
        mono() [0x5e9535]
        mono() [0x5d7459]
        [0x41eb2c8e]

This problem is _very_ intermittent, but based on the symbols I see elsewhere in this thread and in other similar issues, I'd have to imagine there's some sort of race happening in sgen-scan-object.h around when signals are handled -- perhaps caused by other thread zeroing out the "desc" variable?

When we see this problem mono does not shut down until we send a SIGKILL, which also leads me to think it's some race during signal handling.
Comment 24 Ben Burns 2017-05-19 02:38:17 UTC
Since it doesn't seem to be mentioned elsewhere here, https://bugzilla.xamarin.com/show_bug.cgi?id=38941 seems to be a similar case.
Comment 25 Ben Burns 2017-05-19 03:02:02 UTC
The following repro from up thread does appear to crash for me on 5.0.0.100 (tested using the docker mono:5.0.0.100 image), but so far this hasn't produced the same sort of crash, and in both this crash and the one I mention below, mono does exit rather than hang.

using System;
using System.Threading;
using System.Threading.Tasks;

public class Example
{
  public static void Main()
  {
    Parallel.For(0, 10000, i =>
    {
      System.Console.WriteLine("\n\nIteration " + i);
      AppDomain ad = AppDomain.CreateDomain("ChildDomain");
      AppDomain.Unload(ad);
    });
  }
}

Error observed:
Stacktrace:

  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.AppDomain.createDomain (string,System.AppDomainSetup) [0x0000b] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.AppDomain.CreateDomain (string,System.Security.Policy.Evidence,System.AppDomainSetup) [0x000bd] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.AppDomain.CreateDomain (string) [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at Example.<Main>m__0 (int) [0x00015] in <2a8797dd639a48f7a1c58c2517d54de6>:0
  at System.Threading.Tasks.Parallel/<>c__DisplayClass17_0`1<TLocal_REF>.<ForWorker>b__1 () [0x000cb] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.InnerInvoke () [0x0000f] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.InnerInvokeWithArg (System.Threading.Tasks.Task) [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task/<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0 (object) [0x00086] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.InnerInvoke () [0x00025] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.Execute () [0x00010] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.ExecutionContextCallback (object) [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00071] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.ExecuteWithThreadLocal (System.Threading.Tasks.Task&) [0x00050] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.ExecuteEntry (bool) [0x00058] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00074] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0
  at (wrapper runtime-invoke) <Module>.runtime_invoke_bool (object,intptr,intptr,intptr) [0x0001e] in <4dc8ec68b0964e099af86e50301f5f3c>:0
/proc/self/maps:
<< snip >>


The repro from https://bugzilla.xamarin.com/show_bug.cgi?id=14339 also occasionally throws the following:

```System.ApplicationException: Exception in TestRunnerThread ---> System.InvalidOperationException: TestContext: too many Restores
  at NUnit.Core.TestExecutionContext.ReverseChanges () [0x00011] in <573414579ab64f38a2e50dcf72455b60>:0 
  at NUnit.Core.TestExecutionContext.Restore () [0x00001] in <573414579ab64f38a2e50dcf72455b60>:0 
  at NUnit.Core.TestSuite.RunSuiteInContext (NUnit.Core.EventListener listener, NUnit.Core.ITestFilter filter) [0x0002b] in <573414579ab64f38a2e50dcf72455b60>:0 
  at NUnit.Core.TestSuite.Run (NUnit.Core.EventListener listener, NUnit.Core.ITestFilter filter) [0x00039] in <573414579ab64f38a2e50dcf72455b60>:0 
  at NUnit.Core.SimpleTestRunner.Run (NUnit.Core.EventListener listener, NUnit.Core.ITestFilter filter, System.Boolean tracing, NUnit.Core.LoggingThreshold logLevel) [0x00083] in <573414579ab64f38a2e50dcf72455b60>:0 
  at NUnit.Core.TestRunnerThread.TestRunnerThreadProc () [0x00002] in <573414579ab64f38a2e50dcf72455b60>:0 
   --- End of inner exception stack trace ---
  at NUnit.Core.TestRunnerThread.TestRunnerThreadProc () [0x00053] in <573414579ab64f38a2e50dcf72455b60>:0 
  at System.Threading.ThreadHelper.ThreadStart_Context (System.Object state) [0x00014] in <4dc8ec68b0964e099af86e50301f5f3c>:0 
  at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x00071] in <4dc8ec68b0964e099af86e50301f5f3c>:0 
  at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x00000] in <4dc8ec68b0964e099af86e50301f5f3c>:0 
  at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state) [0x0002b] in <4dc8ec68b0964e099af86e50301f5f3c>:0 
  at System.Threading.ThreadHelper.ThreadStart () [0x00008] in <4dc8ec68b0964e099af86e50301f5f3c>:0```
Comment 26 Ben Burns 2017-05-19 03:03:44 UTC
Native stacktrace from first crash above, though I'd imagine it's likely not helpful?

Native stacktrace:

        mono() [0x4ad98a]
        mono() [0x5132f6]
        mono() [0x5c0b0b]
        [0x18bab50]
Comment 27 Ben Burns 2017-05-19 04:32:55 UTC
Ran the same repro code I mentioned above on 4.8.0.254 and it crashed after a long while.

There was no managed stack trace to speak of.

Native stacktrace:

        mono() [0x4b1faf]
        mono() [0x50c2fe]
        mono() [0x427507]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0) [0x7efcccaad0a0]
        mono() [0x5a3f83]
        mono() [0x666a3c]
        mono() [0x5a9545]
        mono() [0x5d2516]
        mono(mono_class_vtable+0xd) [0x5d29dd]
        mono() [0x5d5134]
        mono() [0x5c4e15]
        mono() [0x5c59bb]
        mono() [0x5c5de0]
        [0x418f0818]
Comment 28 Ben Burns 2017-05-19 04:33:12 UTC
Correction - 4.8.0.524
Comment 29 Rodrigo Kumpera 2017-05-19 21:09:39 UTC
Hey Vlad,

This looks like a sgen crash. Can you try to repro it?

I tried with 5.2 and master on OSX and Linux and it didn't crash.

Ben, can you try to update and retry? Additionally, can you install debug symbols as otherwise we have no way to now what's going on.
Comment 30 Ben Burns 2017-05-22 03:09:07 UTC
Assuming you're referring to the mono-dbg package in the Debian Xamarin repo (e.g. for 4.8.0.524 http://download.mono-project.com/repo/debian/dists/wheezy/snapshots/4.8.0.524/main), I installed this into the official mono:4.8.0.524 image and the exception was still not symbolicated.

/usr/lib/libmonosgen-2.0.so.1.0.0 and /usr/bin/mono-sgen appear to be stripped binaries per the file command.  Will give 5.0 a try.
Comment 31 Ben Burns 2017-05-22 03:12:03 UTC
Ah, I meant mono-runtime-dbug, but dpkg -L mono-runtime-dbg appears to have saved me - will try running using /usr/lib/debug/usr/bin/mono-sgen
Comment 32 Ben Burns 2017-05-22 04:23:33 UTC
I'm having trouble getting a symbolicated native stacktrace but I am getting GDB output now. Will attach the full GDB output, but I'm afraid that it's rather large as there are quite a few active threads at the time of crash.

It appears that pretty much everything is waiting on locks in various ways, apart from the one worker thread which caught a SIGSEGV. I think that this may be a red herring though, as IIRC, any thread can be interrupted by the signal handler?

Also worth point out, I'm not seeing the "* Assertion: should not be reached at sgen-scan-object.h:LINE_NO" error here, so I'm a bit concerned this is different from the crash I'm seeing in my production service.

Thread 9 (Thread 0x7f4408a5d700 (LWP 2208)):
#0  0x00007f448c2f9c8d in waitpid () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000004b2042 in mono_handle_native_sigsegv (signal=signal@entry=11, ctx=ctx@entry=0x7f443b15cac0, info=info@entry=0x7f443b15cbf0) at mini-exceptions.c:2469
#2  0x000000000050c2fe in mono_arch_handle_altstack_exception (sigctx=sigctx@entry=0x7f443b15cac0, siginfo=siginfo@entry=0x7f443b15cbf0, fault_addr=<optimized out>, stack_ovf=stack_ovf@entry=0) at exceptions-amd64.c:795
#3  0x0000000000427507 in mono_sigsegv_signal_handler (_dummy=11, _info=0x7f443b15cbf0, context=0x7f443b15cac0) at mini-runtime.c:2896
#4  <signal handler called>
#5  0x00000000005a3f83 in alloc_context_static_data_helper (key=<optimized out>, value=<optimized out>, user=0x80022a01) at threads.c:4139
#6  0x0000000000666a3c in monoeg_g_hash_table_foreach (hash=0x1afb390, func=func@entry=0x5a3f50 <alloc_context_static_data_helper>, user_data=user_data@entry=0x80022a01) at ghashtable.c:354
#7  0x00000000005a9545 in mono_alloc_special_static_data (static_type=static_type@entry=2, size=<optimized out>, align=<optimized out>, bitmap=bitmap@entry=0x7f4408a5c020, numbits=1) at threads.c:4231
#8  0x00000000005d2516 in mono_class_create_runtime_vtable (error=0x7f4408a5c080, klass="System.Runtime.Remoting.Contexts.Context", domain=0x7f43f9593120) at object.c:2016
#9  mono_class_vtable_full (domain=0x7f43f9593120, klass="System.Runtime.Remoting.Contexts.Context", error=0x7f4408a5c080) at object.c:1787
#10 0x00000000005d29dd in mono_class_vtable (domain=domain@entry=0x7f43f9593120, klass=klass@entry="System.Runtime.Remoting.Contexts.Context") at object.c:1753
#11 0x00000000005d5134 in mono_object_new_pinned (domain=domain@entry=0x7f43f9593120, klass="System.Runtime.Remoting.Contexts.Context", error=error@entry=0x7f4408a5c190) at object.c:5204
#12 0x00000000005c4e15 in mono_context_init_checked (domain=domain@entry=0x7f43f9593120, error=error@entry=0x7f4408a5c190) at appdomain.c:380
#13 0x00000000005c59bb in mono_domain_create_appdomain_internal (friendly_name=friendly_name@entry=0x7f43f91b2550 "ChildDomain", setup=setup@entry=0x7f448b2f5ae8, error=error@entry=0x7f4408a5c190) at appdomain.c:568
#14 0x00000000005c5de0 in ves_icall_System_AppDomain_createDomain (friendly_name=<optimized out>, setup=0x7f448b2f5ae8) at appdomain.c:982
#15 0x0000000040fe6d78 in ?? ()
#16 0x00007f4430157370 in ?? ()
#17 0x00007f448ccf8130 in ?? ()
#18 0x00007f448cce03d0 in ?? ()
#19 0x0000000000000000 in ?? ()
Comment 33 Ben Burns 2017-05-22 04:25:17 UTC
Created attachment 22340 [details]
Repro crash log showing GDB crash output
Comment 34 Ben Burns 2017-05-22 04:38:11 UTC
Shot in the dark: is this perhaps an unchecked allocation failure?


from the signaled stack frame above:

static void
alloc_context_data_helper (...) {
    ...
    mono_alloc_static_data (&ctx->static_data, offset, FALSE);
    ctx->data->static_data = ctx->static_data; // signaled on this line
}
Comment 35 Ben Burns 2017-05-22 05:30:50 UTC
I managed to symbolicate the Native stacktrace from the crash which occurred in my production system which did trigger the sgen assertion error. Unfortunately we're running 4.8.0.495 there, although per my initial comment another team has observed the same assertion failure on 5.0.0.100.

From prod service running on 4.8.0.495:

Native stacktrace:
        mono() [0x4a77ca]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0) [0x7fd9d24830a0]
        /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fd9d1f03125]
        /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7fd9d1f063a0]
        mono() [0x640879]
        mono() [0x640a87]
        mono() [0x640b32]
        mono() [0x5f8b76]
        mono() [0x5fb220]
        mono() [0x605f42]
        mono() [0x606bfa]
        mono() [0x5fad43]
        mono() [0x5e9df6]
        mono() [0x5f11dd]
        mono() [0x609e28]
        mono() [0x5f3807]
        mono() [0x5f597d]
        mono() [0x5e9535]
        mono() [0x5d7459]

Which translates to the following, per addr2line --pretty-print -fe /usr/lib/debug/usr/bin/mono-sgen 0x4a77ca 0x640879 0x640a87 0x640bd6 0x5ffeb0 0x5f21d9 0x5f3224 0x5f35a3 0x5f5852 0x5e9535 0x5d7459

load_function_full at /tmp/buildd/mono-4.8.0.495/mono/mini/aot-runtime.c:4933
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0) [0x7fd9d24830a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fd9d1f03125]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7fd9d1f063a0]
is_pid_valid at /tmp/buildd/mono-4.8.0.495/mono/io-layer/processes.c:198
utf16_concat at /tmp/buildd/mono-4.8.0.495/mono/io-layer/processes.c:251
load_modules at /tmp/buildd/mono-4.8.0.495/mono/io-layer/processes.c:1626
D32AddCarry at /tmp/buildd/mono-4.8.0.495/mono/metadata/decimal-ms.c:2131
_mono_reflection_parse_type at /tmp/buildd/mono-4.8.0.495/mono/metadata/reflection.c:1515
mono_property_get_object_checked at /tmp/buildd/mono-4.8.0.495/mono/metadata/reflection.c:750
mono_get_dbnull_object at /tmp/buildd/mono-4.8.0.495/mono/metadata/reflection.c:1187
mono_method_get_object_checked at /tmp/buildd/mono-4.8.0.495/mono/metadata/reflection.c:608
fixup_method at /tmp/buildd/mono-4.8.0.495/mono/metadata/sre-save.c:1745
mono_array_full_copy at /tmp/buildd/mono-4.8.0.495/mono/metadata/object.c:5576
Comment 36 Ben Burns 2017-05-22 05:50:12 UTC
Argh, that hand-symbolicated stack trace was nonsensical because we apparently run 4.4.0 in prod :-(

To avoid further confusion I grabbed this one straight from one of our prod containers...

mono_handle_native_sigsegv at /tmp/buildd/mono-4.4.0.182/mono/mini/mini-exceptions.c:2310
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0) [0x7fd9d24830a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fd9d1f03125]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7fd9d1f063a0]
monoeg_log_default_handler at /tmp/buildd/mono-4.4.0.182/eglib/src/goutput.c:233
monoeg_g_logv at /tmp/buildd/mono-4.4.0.182/eglib/src/goutput.c:114
monoeg_assertion_message at /tmp/buildd/mono-4.4.0.182/eglib/src/goutput.c:135
major_scan_object_no_evacuation at /tmp/buildd/mono-4.4.0.182/mono/sgen/sgen-scan-object.h:82
finish_gray_stack at /tmp/buildd/mono-4.4.0.182/mono/sgen/sgen-gc.c:1151
major_finish_collection at /tmp/buildd/mono-4.4.0.182/mono/sgen/sgen-gc.c:1925
major_do_collection at /tmp/buildd/mono-4.4.0.182/mono/sgen/sgen-gc.c:2051
sgen_perform_collection at /tmp/buildd/mono-4.4.0.182/mono/sgen/sgen-gc.c:2266
sgen_alloc_obj_nolock at /tmp/buildd/mono-4.4.0.182/mono/sgen/sgen-alloc.c:292
mono_gc_alloc_vector at /tmp/buildd/mono-4.4.0.182/mono/metadata/sgen-mono.c:1736
Comment 37 Ben Burns 2017-05-22 06:20:24 UTC
I also managed to dig up the 5.0.0 crash which came w/ the same sgen assertion:

* Assertion: should not be reached at sgen-scan-object.h:90

Native stacktrace:

        /opt/mono-5.0.0/bin/mono() [0x4abadd]
        /lib64/libpthread.so.0(+0xf370) [0x7fe8ce830370]
        /lib64/libc.so.6(gsignal+0x37) [0x7fe8ce27f1d7]
        /lib64/libc.so.6(abort+0x148) [0x7fe8ce2808c8]
        /opt/mono-5.0.0/bin/mono() [0x6631d4]
        /opt/mono-5.0.0/bin/mono() [0x67841c]
        /opt/mono-5.0.0/bin/mono() [0x67857d]
        /opt/mono-5.0.0/bin/mono() [0x64d092]
        /opt/mono-5.0.0/bin/mono() [0x65a9d1]
        /opt/mono-5.0.0/bin/mono() [0x659963]
        /lib64/libpthread.so.0(+0x7dc5) [0x7fe8ce828dc5]
        /lib64/libc.so.6(clone+0x6d) [0x7fe8ce34173d]

Manually Symbolicated version:

mono_handle_native_crash at /usr/src/debug/mono-5.0.0/mono/mini/mini-exceptions.c:2520
        /lib64/libpthread.so.0(+0xf370) [0x7fe8ce830370]
        /lib64/libc.so.6(gsignal+0x37) [0x7fe8ce27f1d7]
        /lib64/libc.so.6(abort+0x148) [0x7fe8ce2808c8]
mono_log_write_logfile at /usr/src/debug/mono-5.0.0/mono/utils/mono-log-common.c:137
monoeg_g_logv at /usr/src/debug/mono-5.0.0/eglib/src/goutput.c:116
monoeg_assertion_message at /usr/src/debug/mono-5.0.0/eglib/src/goutput.c:137
major_scan_object_no_evacuation at /usr/src/debug/mono-5.0.0/mono/sgen/sgen-scan-object.h:90
marker_idle_func at /usr/src/debug/mono-5.0.0/mono/sgen/sgen-workers.c:330
continue_idle_job at /usr/src/debug/mono-5.0.0/mono/sgen/sgen-thread-pool.c:87 (discriminator 1)
        /lib64/libpthread.so.0(+0x7dc5) [0x7fe8ce828dc5]
        /lib64/libc.so.6(clone+0x6d) [0x7fe8ce34173d]

And for completeness, the same code which generated that crash also generated this one:

Managed stacktrace:
  673   at <unknown> <0xffffffff>
  674   at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_vector (intptr,intptr,intptr) [0x00000] in <8e4c7b80ba0942cb8aa6c8f9f3e5b12d>:0
  675   at (wrapper alloc) object.AllocVector (intptr,intptr) <0x00173>
  676   at System.MulticastDelegate.CombineImpl (System.Delegate) [0x0008a] in <8e4c7b80ba0942cb8aa6c8f9f3e5b12d>:0
  677   at System.Delegate.Combine (System.Delegate,System.Delegate) [0x0004f] in <8e4c7b80ba0942cb8aa6c8f9f3e5b12d>:0
  678   at snipped_private_application_code (System.EventHandler) [0x00009] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  679   at snipped_private_application_code (snipped_private_application_type4) [0x00048] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  680   at snipped_private_application_code (System.Transactions.Enlistment) [0x00014] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  681   at System.Transactions.Transaction.DoCommitPhase () [0x0001a] in <29dd66c2bf9a43179a8e55be8dbb00e9>:0
  682   at System.Transactions.Transaction.DoCommit () [0x00095] in <29dd66c2bf9a43179a8e55be8dbb00e9>:0
  683   at System.Transactions.Transaction.CommitInternal () [0x00022] in <29dd66c2bf9a43179a8e55be8dbb00e9>:0
  684   at System.Transactions.CommittableTransaction.Commit () [0x00000] in <29dd66c2bf9a43179a8e55be8dbb00e9>:0
  685   at snipped_private_application_code () [0x0000f] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  686   at snipped_private_application_code () [0x00000] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  687   at snipped_private_application_code (System.Action,string,System.Collections.Generic.IDictionary`2<string, object>) [0x00006] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  688   at snipped_private_application_code (snipped_private_application_type3) [0x002c1] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  689   at snipped_private_application_code (snipped_private_application_type2,bool,int) [0x000f1] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  690   at snipped_private_application_code (System.Collections.Concurrent.BlockingCollection`1<snipped_private_application_type1>[]) [0x00039] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  691   at snipped_private_application_code () [0x00000] in <fdbbe40f91de4db0afa6c9311cd8c822>:0
  692   at System.Threading.ThreadHelper.ThreadStart () [0x0001a] in <8e4c7b80ba0942cb8aa6c8f9f3e5b12d>:0
  693   at (wrapper runtime-invoke) object.runtime_invoke_void__this__ (object,intptr,intptr,intptr) [0x0004d] in <8e4c7b80ba0942cb8aa6c8f9f3e5b12d>:0

Native stacktrace:

        /opt/mono-5.0.0/bin/mono() [0x4abadd]
        /opt/mono-5.0.0/bin/mono() [0x50f7fe]
        /opt/mono-5.0.0/bin/mono() [0x63ac54]
        [0x6fb3e90]

Manually symbolicated version:

mono_handle_native_crash at /usr/src/debug/mono-5.0.0/mono/mini/mini-exceptions.c:2520
altstack_handle_and_restore at /usr/src/debug/mono-5.0.0/mono/mini/exceptions-amd64.c:780
major_copy_or_mark_object_concurrent_canonical at /usr/src/debug/mono-5.0.0/mono/sgen/sgen-marksweep-drain-gray-stack.h:156
Comment 38 Rodrigo Kumpera 2017-05-22 22:20:51 UTC
Hi Ben,

We fixed the crash on comment #32 on mono 5.2.
Comment 39 Ben Burns 2017-05-23 02:56:00 UTC
Thanks, Rodrigo.

I've spent some time today digging into our own code and the various versions of mono in play.

I should say straight off that we do have a native component which we're calling via P/Invoke, and I haven't yet ruled out memory corruption. That said, we haven't observed other indicators of this being the problem, so it's low on my list of priorities.

Assuming it's not memory corruption, the assertion failure occurs during major collection when scanning objects in the gray object queue. This can only happen because something calls GRAY_OBJECT_ENQUEUE with an invalid SgenDescriptor. I haven't yet been able to get it to dump the value of said descriptor, but I'm guessing it's 0.

Biggest problem for me is that out of ~96 instances of my service this crops up around once every 3-5 days, and I've been very unsuccessful at getting it to repro locally on the same docker image. I've gotten in touch with my colleague who saw the same assertion failure on 5.0.0 and he's going to get me set up w/ his code, as that seems to repro much more quickly (order of hours instead of days). Once I can get GDB connected up to it, I should have a much better feel for what's going on.
Comment 40 Ben Burns 2017-05-24 00:29:12 UTC
I managed to repro the assertion failure in the official mono:5.0.0.100 docker image.

Thread 83 (Thread 0x7f6737fff700 (LWP 231)):
#0  0x00007f67398b6489 in __libc_waitpid (pid=pid@entry=361, stat_loc=stat_loc@entry=0x7f6737ffc7fc, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1  0x00000000004ada49 in mono_handle_native_crash (signal=<optimized out>, signal@entry=0x69abd3 "SIGSEGV", ctx=ctx@entry=0x7f6737ffd280, info=info@entry=0x7f6737ffd3b0) at mini-exceptions.c:2567
#2  0x000000000042671c in mono_sigsegv_signal_handler (_dummy=11, _info=0x7f6737ffd3b0, context=0x7f6737ffd280) at mini-runtime.c:2821
#3  <signal handler called>
#4  __GI_abort () at abort.c:125
#5  0x00000000004adac9 in mono_handle_native_crash (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2615
#6  <signal handler called>
#7  0x00007f673931b067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#8  0x00007f673931c448 in __GI_abort () at abort.c:89
#9  0x000000000067ad49 in mono_log_write_logfile (log_domain=0x1 <error: Cannot access memory at address 0x1>, level=G_LOG_LEVEL_ERROR, hdr=6, message=0x7f673933199a <_IO_vfprintf_internal+22490> "\200\275(\373\377\377") at mono-log-common.c:137
#10 0x000000000068ff3d in monoeg_g_logv (log_domain=log_domain@entry=0x0, log_level=log_level@entry=G_LOG_LEVEL_ERROR, format=format@entry=0x6997b0 "* Assertion: should not be reached at %s:%d\n", args=args@entry=0x7f6737ffeca0) at goutput.c:115
#11 0x00000000006900d3 in monoeg_assertion_message (format=format@entry=0x6997b0 "* Assertion: should not be reached at %s:%d\n") at goutput.c:135
#12 0x000000000065e2b8 in major_scan_object_concurrent_with_evacuation (full_object=0x7f6700000000, desc=<optimized out>, queue=queue@entry=0x7f673ab13010) at sgen-scan-object.h:90
#13 0x000000000065f38c in drain_gray_stack_concurrent_with_evacuation (queue=<optimized out>) at sgen-marksweep-drain-gray-stack.h:339
#14 drain_gray_stack_concurrent (queue=0x7f673ab13010) at sgen-marksweep.c:1321
#15 0x0000000000672151 in marker_idle_func (data_untyped=0x7f673ab13008) at sgen-workers.c:328
#16 0x000000000067134f in thread_func (thread_data=0x7f673ab13008) at sgen-thread-pool.c:151
#17 0x00007f67398af064 in start_thread (arg=0x7f6737fff700) at pthread_create.c:309
#18 0x00007f67393ce62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


It appears to be as I said before, bad SgenDescriptor in the gray object queue. For giggles I'll switch back to the non-concurrent marksweep and see if I still observe this issue. Otherwise I'm a bit pressed for time on other things for the remainder of the week so I don't expect I'll have much chance of actually getting it up and running in the debugger. If I do, I'll chuck a conditional breakpoint on the enqueue and hopefully I'll be able to see a bit more into what's leading up to this.
Comment 41 Ben Burns 2017-05-24 00:39:36 UTC
This is likely a red herring, but just in case it helps -- yesterday while trying to trace through the source of the bad SgenDescriptor, I ran across par_copy_object_no_checks and copy_object_no_checks in sgen-copy-object.h.

If the vtable is potentially not valid, perhaps it is unsafe to be using it to fetch the SgenDescriptor when adding the object to the gray object queue? I realize there's an assert in par_copy_object_no_checks guarding that the vtable has a descriptor at all, but is it possible that this descriptor has been added to the vtable but not yet had its state initialized?
Comment 42 Vlad Brezae 2017-05-26 00:21:31 UTC
Hey Ben,

     Using the Parallel.For appdomain creation test case, on a docker image, I was able to reproduce the crash in alloc_context_static_data_helper, but not the sgen crash yet. Did you reproduce the sgen crash using the same test case ?
Comment 43 Ben Burns 2017-05-26 01:04:45 UTC
Hi Vlad,

No, I wasn't able to repro the Sgen assert crash using the repro above. Apologies for the confusion on that.

I've gotten the sgen crash to repro fairly readily (down to a few hours from several days), on mono 5.0.0.100, but not using code that I can share.

My running theory is that when an object is pushed onto the gray object queue desc & DESC_TYPE_MASK is 0 (the only invalid value AFAIK). I tried adding a conditional breakpoint to demonstrate this, but strangely gdb tells me that both desc and obj were optimized out from sgen_gray_object_enqueue in sgen-gray.c.

TBH I'm not very familiar w/ garbage collection in general, let alone sgen, and I'm only passingly familiar w/ compiler optimizations, so I'm not very sure where to go from here. My assumption is that if the compiler can safely optimize out desc from the signature of sgen_gray_object_enqueue that it's either because of some ABI-level optimisation which I'm very unfamiliar with, or because sgen_gray_object_enqueue is only ever called with one value. Either way, I'm not sure why *obj would be optimized out. From what I can see of code paths leading to sgen_gray_object_enqueue, the SgenDescriptor comes from the object's vtable. If the object initialization silently failed in some way, I could imagine that the descriptor type wouldn't have been set, leaving the default value of 0 (assuming zeroed initial values of course).

But that's a long stretch based on a few too many assumptions, so I'm not trusting any of it until I can verify those assumptions more. That said, hopefully some of this will help you all to spot what might be happening here.

In the mean time as a coarse measure I see that there's a SGEN_CHECK_GRAY_OBJECT_ENQUEUE macro - I may try building with that enabled and see what it tells me. If nothing, I'll try getting a build together w/o optimizations, but if this is rooted in the compiler I'm a bit worried that the problem will disappear in that case.

Either way, thanks in advance for any help you can provide. I realize you're going to be flying a bit blind until I can share something to repro this.
Comment 44 Vlad Brezae 2017-05-30 22:54:56 UTC
Fixed the crash when appdomains are created in parallel (not an sgen crash). Closing this issue.

@Ben sgen-scan-object.h is just a general place where the gc crashes when references are not pointing where they should be. Your issue is extremely likely unrelated with other similar crashes on bugzilla. This can happen because of very many reasons and it is not really possible to debug from the trace. Typically, in order to debug these crashes, we need a repro so we can reproduce the issue with a recompiled mono that contains heavy logging. Please submit a new bug with some repro steps (getting an NDA signed would be an option) or contact me directly mail/gitter to see if we can set up some debugging.
Comment 45 Ben Burns 2017-05-31 00:34:09 UTC
For anyone finding this later on and thinking https://xkcd.com/979/ sort of thoughts, I've reached out to Vlad directly to see how we might debug this. If I raise a new issue (I likely will) I'll add a reference to it here.

Note You need to log in before you can comment on or make changes to this bug.